pax_global_header00006660000000000000000000000064126273634610014524gustar00rootroot0000000000000052 comment=f1b9b7f767c8941a4b1898ad9a3294d329f05cbe apophenia-1.0+ds/000077500000000000000000000000001262736346100137325ustar00rootroot00000000000000apophenia-1.0+ds/AUTHORS000066400000000000000000000000371262736346100150020ustar00rootroot00000000000000Ben Klemens (fluffmail@f-m.fm) apophenia-1.0+ds/COPYING000066400000000000000000000431221262736346100147670ustar00rootroot00000000000000 GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License. apophenia-1.0+ds/ChangeLog000066400000000000000000001362431262736346100155150ustar00rootroot00000000000000Since April 2014, this changelog includes only those changes that break existing code or are especially notable. For a full list of changes, see the Git history at https://github.com/b-k/Apophenia/commits/master [Key, pre-March 2014: --Addition or improvement **Change that could require recoding existing code. !!Big. ] October 2014 ** apop_model_stack --> apop_model_cross August 2014 ** default for apop_data_pack is .all_pages='y' (was 'n'). ** remove apop_plot_lattice, apop_plot_triangle, apop_plot_line_and_scatter, apop_plot_qq. Find them at https://github.com/b-k/apophenia/wiki/gnuplot_snippets . March 2014 --Command-line tools print help should a user add a --help option. February 2014 !!OpenMP for threading. All calls to apop_map and friends, apop_model_draws, and others auto-thread. !!apop_rng_get_thread to get a thread-specific RNG, so you can thread random processes. January 2014 **View macro reform: Apop_cols Apop_rows A contiguous set of columns or rows as an apop_data set (with names) Apop_col Apop_row One column or row as an apop_data set Apop_col_t Apop_row_t One column or row as an apop_data set, retrieved by row/col name Apop_col_v Apop_row_v One column or row as a gsl_vector Apop_col_tv Apop_row_tv One column or row as a gsl_vector, retrieved by row/col name Apop_matrix_col Apop_matrix_row One column or row as a gsl_matrix **MLE methods are now strings instead of all-caps enums. **All blank elements of a data->text grid point to the same NUL string. --Add apop_model_metropolis; revise apop_update accordingly. --apop_draw uses metropolis to draw from any model with a log likelihood/p and where data size>1. **Replace all instances of output_file with output_name (GNU sed -i 's/output_file/output_name/g' *c) --Consolidated headers **apop.h no longer #includes time.h or unistd.h **apop_draw returns zero on success; nonzero on failure. **Removed the BIC-by-cells from estimation output. Added AIC_c. OLS now reports the ICs along with R^2. December 2013 --append and replace options for apop_text_to_db --apop_probit bug fix **apop_plot_histograms now uses gnuplot's impulses, not boxes by default---handles missing zero bins better. **MLE path trace lists the probs/loglikelihoods in the vector of the apop_data set it produces. path is apop_data**, from apop_data*. **apop_data_transpose has an .inplace option, which is 'y' by default. Add .inplace='n' to existing uses. --siman checks constraints for the starting point. --Mixture models overhauled. --cleaned up the command-line utilities **removed apop_lookup. November 2013 --rewrite apop_data_sort to allow sorting by multiple columns or text or names --apop_pmf now has a CDF method. --fixed up K-S tests. --removed the Swig Python wrapper from this package. **replaced char apop_opts.db_nan[101] with char *apop_opts.nan_string. More descriptive, easier to use. **apop_name_find does plain case-insensitive search; no regexes. October 2013 **all built-in models (apop_ols, apop_dirichlet, ...) are now apop_model* (ptr-to-struct), from apop_model (plain struct). **apop_estimate and apop_copy take in an apop_model* instead of plain apop_model. **printing no longer part of the apop_model struct; uses a vtable. September 2013 **change vbase, m1base, m2base ==> vsize, msize1, msize2 **Estimate returns void (was apop_model*) --vtable mechanism improvements **Remove score, predict, and parameter_model from the apop_model object; use the vtable mechanism. **Upgrade model p, ll, cdf, constraint to return long double (was double) **consolidate vector_var and vector_weighted_var. same with cov, mean, weighted_skew, and weighted_pop. Users just have to replace apop_vector_weighted_var with apop_vector_var. **removed deprecated.h entirely. --apop_data_add_named_elmts puts new data in the vector, not the matrix, because it is intended for a list of scalars (==a vector). If you use apop_data_get(infodata, .rowname="statistic name") then you'll be able to retrieve the element either way. **removed apop_line_to_data and apop_line_to_matrix. Use apop_data_fill and apop_data_falloc. August 2013 --apop_map(_sum) properly threads data-row mappings. .inplace='v' to return NULL. **Remove apop_settings_alloc, apop_settings_group_alloc **Change Apop_row to return an apop_data set, not a vector (for which use Apop_matrix_row). **Apop_settings_set sets model->error='s' on error, instead of returning. **Add a .want_path='y' setting to your apop_mle_settings group, and I'll put a list of the points tried by the optimizer in an apop_data set named path (in the settings group); see documentation for details. Remove the former trace_path mechanism. **removed apop_(vector|matrix)_increment. Use, e.g., *gsl_vector_ptr(v, 7) += 3; or (*gsl_vector_ptr(v, 7))++. **Some mu and sigmas => μ and σ **Removed apop_settings_alloc, apop_settings_group_alloc **Change Apop_row to return an apop_data set, not a vector (for which use Apop_matrix_row). June 2013 --replaced makefile in base directory with ./configure. --version number now equals build date. **name->title is a ptr; name->column => name->col **removed apop_strip_dots; it's up to the user to give reasonable names for the db. May 2013 --jacobian transformations --Apop_model_copy_set to copy a model and add a settings group at once --mixture models --data-data composition --added apop_model_draws !!vtables, allowing for more functions with special cases for certain model(s) outside of the model object itself. April 2013 --plugged some memory leaks --default tolerance for MLE is much finer (1e-5). --Added apop_text_fill **Finally removed support for gsl_histograms, including the apop_histogram model. This cut has been promised for about four years now. Use the apop_pmf instead. **apop_data_to_bins no longer modifies the input data in place. It now makes a copy and modifies the copy. **Removed apop_crosstab_to_pmf. There's a version at https://github.com/b-k/Apophenia/wiki/Crosstab-to-PMF for matrices. **Removed apop_vector_to_array. If your array has stride 1, use your_array->data; else write a for loop to copy out the data. **Removed apop_array_to_matrix. March 2013 **apop_text_paste now prints the pasted string at verbosity level 3 (formerly 2). February 2013 --Starting_point in Bayesian updating no longer does anything, but it was never significant to begin with. Added some verbosity options to apop_update. January 2013 --Logit regression much smarter about picking a starting point. --Defaults for simulated annealing try 1600x fewer points. Prior settings were overkill. --configure.ac checks for native asprintf and uses it if it is present **Removed apop_db_merge and apop_db_merge_table. Get them from http://modelingwithdata.org/arch/00000141.html . **Removed apop_matrix_fill; use apop_data_fill **Removed apop_array_to_data. **Removed apop_matrix_correlation; use apop_data_correlation **Deprecated apop_rank_compress; use apop_data_to_dummies(..., .keep_first='y'). --apop_model_stack December 2012 --Added an apop_pmf_settings group, eliminating a couple of hacks (e.g., see October 2012). --faster read-in of text files --Exponential model uses data in both the vector and matrix parts of the apop_data input --transformation to generate mixtures (i.e., linear combinations) of models. **apop_data_transpose now transposes the text element as well as the matrix (by default). Use apop_data_transpose(your_matrix, .transpose_text='n') to replicate the previous behavior. --removed Autoconf pkg-config macros, because Autoconf no longer needs the help. --writing apop_data to DB uses prepared queries where possible => much faster. **It is now up to you to put apop_query("begin"); / apop_query("commit"); wrappers around writing of tables to the database. November 2012 --apop_vector_unique_elements, apop_data_to_dummies, and apop_data_to_factors handle NaNs better; put them at the end of the sort order. !!Finally added an .error element to apop_data and apop_model structs, thus simplifying error-checking. --Apop_stopif macro, rendering the Apop_assert family largely obsolete (so if you're using them in your own work, consider them deprecated...). --Where the assert macros used to abort() on errors, they now send signal(SIGTRAP), making debugging a little easier. Most host systems force an abort on SIGTRAP anyway. October 2012 **Removed apop_strcmp. If you still need it, this macro is basically equivalent: #define apop_strcmp(a, b) (((a)&&(b) && !strcmp((a), (b))) || (!(a) && !(b))) --clean up of parameter-fixing model transformation. --split off multiple imputation variance code. **Removed apop_vector_grid_distance. Use apop_vector_distance(v1, v2, .metric='M'); --If apop_pmf.dsize==0, apop_pmf.draw returns a row number, not the data in that row. This will change shortly. --Logit draw function, akin to apop_ols.draw. Both will change shortly. --apop_data_to_factors now auto-allocates a matrix if need be (because it always auto-allocated a vector). September 2012 --\0 in text files counts as white space --fixed counting bug for text files with , sequences. July 2012 --some MLE cleanup --fixes to apop_rake --apop_logit.score fixed June 2012 --The sample kurtosis calculation is still more precise. May 2012 --Autotools improvements. Use the standard 'make check' instead of the ad hoc 'make test'. !!Set apop_opts.stop_on_warning='n' to never abort() on any type of error. E.g., GUIs that should never halt will use this. Default is still to halt on errors, because that's most useful for interactively developing numeric analyses. --Use apop_opts.log_file=fopen("yourlog", "w") to divert the warnings/errors from stderr into yourlog. --Some formerly void functions now return an int, to return an error code. E.g., apop_opts.stop_on_warning='n'; if (apop_data_set(data, row, col)) printf("Error! Nothing was set.\n"); --Fixed a memory leak in simulated annealing. --Apop_data_row and apop_data_set_row handle row names March 2012 --bug fix in apop_text_to_data when input file has no names. --probit dlog likelihood isn't implemented for N>2; now acknowledging this. --added apop_data_get_factor_names --apop_(vector|matrix)_(map|apply) now accepts NULL input. January 2012 **Removed apop_matrix_var_m, which nobody was using. --bug fix in apop_vector_distance for Ln norms where n is odd. --reading data from text files rewritten; much more robust. **A space is no long a default delimiter. Use apop_opts.input_delimiters="| ,\t" to restore old default. **apop_opts.db_nan is no longer a regex; I just do a case-insensitive comparison. December 2011 --apop_model.textsize is now a size_t instead of an int. --apop_update accepts likelihoods with no pre-set parameters November 2011 --moved to Github; some changes to structure and documentation to accommodate. **apop_ols.predict didn't do the OLS shuffle if the input has no vector; this was anomalous. October 2011 --apop_text_paste added. **apop_multinomial and apop_binomial overhauled. No longer accepting Bernoulli draws as input. --standardization: make docs => make doc September 2011 --`query turned up a blank table' warning turns up when apop_opts.verbose >=2. (used to be >=1) --apop_t_test and apop_paired_t_test are quieter---no intermediate results until apop_opts.verbose >=2. **apop_opts.db_name_column now has a blank default (instead of the SQL-specific and potentially surprise-inducing "rowname"). August 2011 --Bootstrap/Jackknife are better with text **apop_data_memcpy used to reallocate memory for the text and names elements; use apop_data_copy if you want allocation done for you. July 2011 --Apop_data_rm_rows now accepts a test function as well as a fixed list of rows to drop. June 2011 --apop_crosstab_to_db handles missing labels and NaNs better. **apop_matrix_to_db removed (as promised a few years ago). Use apop_matrix_print(yourdata, "tabname", .output_type='d'). **F-test defaults now match ANOVA tradition. --documentation script doesn't use GNU extensions to awk; should now be POSIX-standard. May 2011 **apop_map returns a data set with an allocated/filled vector when not called with .inplace='y'. Before, it had been making a full copy, which is idiosyncratic. --apop_rake accepts a weights column. --apop_anova uses variadic arguments for a marginally nicer interface (and better argument checking). **apop_data_to_dummies tries to give nicer labels. You may have to recode things if you relied on the old labels. March 2011 --Header files have been merged. A few long files is as easy to grep as a multitude of nearly one-line files. If you #include instead of the individual headers, then this shouldn't affect you. Due to redundancy, compilation with gcc takes 3% longer. 0.99 February 2011 !!The apop_PMF model has more support: --New supporting functions: apop_data_pmf_compress, apop_model_to_pmf --functions that took apop_histogram models now take apop_pmfs as well: apop_test_chi_squared_goodness_of_fit, apop_test_kolmogorov --Consider the apop_histogram to be deprecated. Only two associated functions were removed; see below. **apop_histogram_plot is removed. Replace with: fprintf(apop_opts.output_pipe, "plot '-' using 1:3 with boxes\n"); apop_model_print(hist); fprintf(apop_opts.output_pipe, "e\n"); **apop_histogram_print was a bad idea to begin with, because it basically replicates gsl_histogram_fprintf. Use apop_model_print(your_histogram), which calls gsl_histogram_fprintf, or call that function directly. The only difference: the GSL function prints [start of bin] [end of bin] [value] and apop_histogram print showed [start of bin] [value] December 2010 **apop_maximum_likelihood no longer calls apop_prep. If you want that, use apop_estimate. --apop_text_to_db lets users specify types and keys. --deleted some obsolete/deprecated items: apop_error, apop_multinomial_settings --apop_data_split retains text when splitting by rows; still ignores it when splitting by columns. November 2010 --apop_listwise_delete uses apop_opts.db_nan to check for missing data in the text part of the input data. --apop_data_split handles names September 2010 **Multinomial distribution sets N to be the length of the row (a single observation) rather than the size of the full data set. Added apop_multinomial.parameter_model method for testing purposes. August 2010 ** What was apop_assert => apop_assert_c; what was apop_assert_s => apop_assert. Their arguments are slightly different, and the thing that was asserted no longer prints along with the message you chose. --verbosity defaults to 1. Queries print at verbosity >=2. --apop_data_to_db writes the weights. --Iterative proportional fitting, aka raking. **apop_text_add now frees the contents of cell in the text grid that you are about to overwrite, thus preventing memory leaks without effort from the user. If your existing code has other pointers to the string in that text cell, you'll have to replace the now string-freeing apop_text_add with asprintf(&(your_dataset->text[row][col]), "your string"). July 2010 ** apop_regex now gets all matches when you pull substrings. Each row of the text grid is a match, and if you have multiple substrings, each match's substrings will be along the columns. May require recoding because the substrings used to be along the rows; just switch the indices. June 2010 ** Removed the apop_rank settings group, and thus all the code related to it. It was just the wrong place to do this. Added apop_data_rank_expand to convert rank data to what the various models typically expect. This is another step for some users and could be a problem if the counts get into the billions, but it still makes more sense than rewriting every model twice. May 2010 **apop_data_prune_columns_base now takes in a list of strings terminated by a NULL, not by a zero-length string. --apop_data_get_row lets you pull a view from a data set. [this was briefly the apop_data_row] ==>apop_data_set_rows, apop_data_rm_rows ==> apop_data_listwise_delete is fifty lines shorter. --apop_parts_wanted_settings: fixes some infinite loops (est needs parameter models -> p.m. bootstraps for variances -> bootstrap runs estimate -> repeat) and allows just-the-parameters estimation when you want it. --cleaned up build system, including an added RPM spec file April 2010 --apop_t_distribution now has three parameters: mean, std dev, df. That is, it is based on un-normalized data. **apop_random_int and apop_random_double removed for not being particularly useful. March 2010 **The apop_predict special case for when all data is non-missing was a bit too special, and has been eliminated---you now have to specify the first column as NaN yourself. E.g., Apop_col(data, -1, to_nan); gsl_vector_set_all(to_nan, GSL_NAN); This will make things more predictable, and save you if(!has_nans)... else... kind of statements. **Removed the prepared element of the apop_model. **apop_model_prep ==> apop_prep for consistency with other apop_model dispatch functions. apop_model_prep left for now as an alias in deprecated.h !!apop_parameter_model: a method for getting the distribution of a parameter. **Moved OLS-family test stats (pval, qval, whatever) to a page of your_estimate->info. It won't be there for long either. --settings macros let you use lowercase, thus entirely ignoring that they're macros. **apop_settings_rm_group function, which you were probably not using, changed to apop_settings_remove_group; apop_settings_get_group function => apop_settings_get_grp. Having a macro and a function that differed by a question of case was a bad idea to begin with. February 2010 !!Overhauling the output from estimations; pardon our dust. --Added CDF method to the apop_model, including apop_cdf dispatch method and default via random draws. **Defininitvely removed the residual, covariance, and llikelihood elements from the apop_model struct. The first two will be pages appended to the data and parameters, respectively, and the last will be in the Info page appended to the parameters. **Renamed apop_ls_settings (least square) to apop_lm_settings (linear model) "s/apop_ls/apop_lm/g" should work. **Sundry lists of scalars, like the R^2 table and the estimation routine's info table put the data in column zero, not column -1. In the next bullet point you'll see how this simplifies retrieval. **Added an info element to the apop_model--> more shuffling of auxiliary info. --Find results like the log likelihood or AIC via, e.g., apop_data_get(your_model->info, .rowname="log likelihood"); **Find the predicted/residuals via apop_data_get_page(your_model->info, "Predicted"); This means that the input data set is read-only again. --Find the parameter covariances via apop_data_get_page(your_model->parameters, "Covariance"); **apop_estimate_coefficient_of_determination takes in the model again. Just replace est->parameters in your argument with est. apop_ols calls this fn automatically now [apop_data_get(your_model->info, .rowname="R sq")], so you probably aren't even calling it anymore. **apop_data_add_named_elmt now writes to the zeroth element of the matrix, not the vector. So instead of apop_data_get(data, .rowname="R squared", -1), just go with apop_data_get(data, .rowname="R squared"). This affects many of the elements of the info-type matrices. --apop_data_pack, apop_data_unpack, apop_ml_impute, apop_map offer an option to use all pages. 0.23 January 2010 **expected_value element of the model renamed predict; made coherent across models. !!apop_data set now has a ->more pointer to an additional apop_data set, e.g., for data + covariances or predictions + confidence intervals. --apop_ml_imputation renamed to apop_ml_impute. "#define apop_ml_imputation apop_ml_impute" retains noun-form name, but consider it deprecated. !!apop_estimate now copies, preps, then estimates. Estimate method of apop_model struct can thus assume the copy/prep step has been done; probably should not do these itself. As a side-effect, apop_maximum_likelihood's second argument is now a apop_model* (used to be apop_model). --apop_regex and apop_strcmp, for easier searching through your info pages. --minor rewording of COPYING2. **Because the Predicted table is now part of the parameter set, not the model, apop_estimate_coefficient_of_determination now takes in the parameter set, not the model. Just replace est in your argument with est->parameters. **apop_multinomial_probit folded into apop_probit, where it should've been all along. Regex for the fix: "s/apop_multinomial_probit/apop_probit/g" December 2009 --apop_strcmp --apop_loess model: 3,500 lines of code from the netlib archive, lovingly restored. November 2009 --apop_rm_columns bug fixed by Birger Baksaas. --apop_text_to_db attaches numeric affinity to sqlite3 columns, making numeric comparisons easier. --apop_histogram_model_reset's first argument is now "base" instead of "template", as a concession to C++ users. September 2009 --Many minor changes, mostly regarding adding optional arguments. --Dirichlet model --Output functions now take a consistent set of specs regarding to where they will write. You no longer have to use the global apop_opts settings if you don't want to. August 2009 --apop_map and apop_map_sum. Reworks the apop_(map|apply) system to be more flexible but a little more complex. -apop_(data|matrix|vector)_fill is now more robust---no more int vs float issues. **Removed apop_count_cols --Default univariate RNG, if you don't have one: Adaptive rejection markov chain sampling **.use_covar and other such settings now take 'y' or 'n', not 0 or 1. --new macro Apop_settings_set = Apop_settings_add, but makes more human sense --numeric covariance, formerly maligned, now works. July 2009 --multivariate gamma, log-gamma. --t, F, chi^2, Wishart distributions, for description [and Bayesian updating] --apop_matrix_to_positive_semidefinite and apop_matrix_is_positive_semidefinite --bug fixes --Apop_model_add_group replaces Apop_settings_add_group, and is much more easy to work with. June 2009 --More variadicized functions --notably, apop_estimate is much more useful --apop_opts.version. --apop_(vector|matrix|data)_stack have an inplace option, making stacking inside a for loop easier. --apop_test convenience function --more autoconf macros ==> some compilation hacks now done right May 2009 --mysql functions slightly cleaned up --apop_opts.db_user and apop_opts.db_pass for mysql. !!Functions that take lots of basically optional inputs, like apop_text_to_db, now use some designated initializer magic to let the user rearrange or omit inputs. **apop_dot also now allows designated initializers, which breaks (only) calls of the form apop_dot(a_vector, a_matrix, 't'). Replace with apop_dot(a_vector, a_matrix, 0, 't') or apop_dot(a_vector, a_matrix, .form2='t') --With optional inputs, some functions now handle RNGs for the lazy user ==>added apop_opts.rng_seed --apop_vector_distance is much more versatile **Removed apop_matrix_summarize. Too much like apop_data_summarize. Just replace every instance of apop_matrix_summarize(m) with apop_data_summarize(apop_matrix_to_data(m)). April 2009 --sample moments are now mega-accurate---possibly the most unbiased estimators in code today. !!Python interface via swig March 2009 --apop_matrix_realloc, apop_vector_realloc --sqlite queries no longer rely on a temp table ==> faster --fixed bugs in apop_table_exists making queries fail in Cygwin. Jan/Feb 2009 --Added more tests; some cleanup in test.c --Binomial distribution looks in both the data set's vector and matrix December 2008 --When writing x=infinity to a db table, I now write 'inf' to the db, instead of breaking. SQLite has no standard here. October 2008 --bug fixes to new apop_data_show --bug fixes to apop_bernoulli.p --apop_update tweaks September 2008 --Documentation overhaul --apop_data_show is much more screen-friendly. Keep using apop_data_print for more machine-readable and less fixed-width output. **apop_plot_histogram now takes in a vector, bin count, and name of output. This is what it did in the first half of the year. The current version of apop_plot_histogram, which acts on a histogram model, is renamed to apop_histogram_plot. August 2008 --Constraints in ML work better. **Overhaul of some discrete choice models --Added tests for the probit and logit. --fixed a bug revealed by the tests **the first choice has a fixed value of zero. **You'll probably need to call Apop_category_settings_add before estimating your model, unless the outcome choice variable is the 0th column of the matrix. --more to come, e.g., multinomial probit will be merged with ordered probit. -Adding a settings group of a given type when that group already exists used to induce an error; now the old type is replaced with a clean default. --bug fix for apop_test_fisher_exact on non-square matrices --apop_settings_add and company do more work in functions and less in macros. --removed the settings_ct element of the apop_model; using a sentinel at the end of the array instead. --Slightly improved reading of text files. --Bootstrap/jackknife act on models with parameters in both matrix and vector form. July 2008 --Guts of apop_plot_histogram now use the apop_histogram model. ** Also, it no longer normalizes the histogram to integrate to one by default. You need to explicitly request this via apop_histogram_normalize --apop_plot finally deleted. --apop_histogram_plot deleted; use apop_plot_histogram. --Added apop_vector_skew_sample, apop_vector_kurtosis_sample. May 2008 --apop_settings_rm_group added. --mysql interface has the beginnings of support for multiple semicolon-separated queries in one call. --apop_histogram_refill_with_model ==> apop_histogram_model_reset; apop_histogram_refill_with_vector ==> apop_histogram_vector_reset. April 2008 --apop_dot handles names. --apop_t_test now behaves correctly when one vector is of length 1. --some improvements/fixes when dealing with mySQL. --apop_sv_decomp renamed to apop_matrix_pca. Minor changes so that it correctly works as such. --apop_text_to_(db|data) handles column names like it used to, which works better. Also a few other fixes for odd situations. March 2008 --Various improvements in reading in from text. -- a, "b, c", d will now correctly read in as three elements: a; then "b, c"; then d -- a,,b,c reads as a, NAN, b, c. February 2008 --Some of the header references didn't work for a fresh install. --bug fixes, esp. with apop_test_kolmogorov --added convenience fn apop_data_transpose January 2008 --Apop_assert, which streamlines the use of apop_error (thus shrinking the code base by 2%). --apop_OLS now has a log likelihood (also shuffled some of the code around) --bug fix in apop_binomial.p. **More name reform: apop_correlation_matrix --> apop_matrix_correlation; apop_data_correlation_matrix --> apop_data_correlation; apop_covariance_matrix --> apop_matrix_covariance; apop_data_covariance_matrix --> apop_data_covariance. --apop_count_(rows|cols)_in_text are now static functions and removed from the documentation. --Removed apop_random_beta, which had been set as deprecated earlier. Use apop_beta_from_mean_var. **Removed apop_vector_isnan---just use apop_vector_map_sum(your_vector, isnan) **Removed apop_vector_finite---just use apop_vector_bounded(your_vector, INFINITY) --For your convenience, added Apop_settings_alloc() macro. **apop_histogram_params ==> apop_histogram_settings **apop_kernel_density_params ==> apop_kernel_density_settings [The settings/params distinction is in some ways arbitrary anyway.] --bug fix in apop_mle.c: wasn't copying output parameters to the estimated model in some cases. --As part of name reform, all function names are being switched to lower-case throughout, so apop_ANOVA ==> apop_anova. Am keeping the old forms via macros. Notice also the non-yelling macro capitalizations above, such as Apop_assert. !!**Revised the settings for the apop_model. model_settings and method_settings are out, replaced by a much more organized single list of settings. --added the apop_lookup command-line program. **Renamed the apop_multinomial_logit model the apop_logit, because the binary logit is a special case that requires no special handling. **Reversed the signs on the probit coefficients, to better conform to the norm. December 2007 --added apop_vector_moving_average --apop_model now has prep and print methods. **apop_p, apop_log_likelihood, apop_score now take a pointer to an apop_model, not the model itself. --apop_multinomial_probit model --When a data set has matrix and vector, apop_dot accepts a 'v' to use the vector. --apop_plot_lattice produces a more attractive (and standard-form) plot. --apop_data_print works much better now. **apop_model no longer requires your data input to be const. It probably will be const, but it's not the interface's place to dictate that. **apop_data_unpack no longer allocates a new data set, but writes to an input data set assumed to be of the right size. --added apop_ANOVA to produce one- or two-way ANOVA tables from the database. **apop_test_ANOVA renamed to apop_test_ANOVA_independence to create a little more cognitive distance. --apop_data_text_to_factors --APOP_COL_T and APOP_ROW_T macros, to pull a column or row by name --apop_beta_from_mean_var produces a beta-distributed model with the right (alpha, beta) parameters. **Thus, apop_random_beta is now marked as deprecated. **apop_x_prime_sigma_x removed, on grounds of being silly. If you want it back, see model/apop_multivariate_normal.c, where it is now a static function. **apop_qq_plot --> apop_plot_qq --Multinomial logit model (and the probit now has names). **Revised the bin-syncing methods. apop_vectors_to_histogram and apop_model_to_histogram are now out; apop_histogram_refill_with_vector and apop_histogram_refill_with_model are in. **Also removed apop_model_test_goodness_of_fit as redundant. Just produce a histogram, use the above refill functions, and send your two histograms to apop_histograms_test_goodness_of_fit. If you do this often, you can write a convenience function to do that as quickly as I could. **apop_vector_replace and apop_matrix_replace are redundant---just use apop_(vector|matrix)_apply. --The covariance matrix is now produced via the derivative of the score function at the parameter. I follow Efron and Hinkley in using the estimated information matrix---the value of the information matrix at the estimated value of the score---not the expected information matrix that is the integral over all possible data. November 2007 --added apop_model_copy_set_string to get a copy of a model whose model_settings is just a string. **Thanks to this, folded all of the _rank versions of models into their base models. Set model_settings to "r" to use the rank version. --The default MLE method is now the Nelder-Mead Simplex algorithm, instead of the Fletcher-Reeves conjugate gradient. This is more conservative. --apop_(vector|matrix|matrix_all)_map_sum to get the sum of a function applied to a vector. E.g., find count of NaNs with apop_vector_map_sum(v, isnan); --apop_logit bug fix. October 2007 --apop_estimate now defaults to using MLEs, meaning you don't have to explicitly specify an estimate method for MLE models. --apop_crosstab_to_db reads both the matrix and text elements of the input apop_data set. --apop_system convenience function, to make C feel more like a scripting language. --Added some SQLite functions for mySQL compatibility: var_samp, var_pop, stddev_samp, stddev_pop, std. --Probit patched to not NaN for very unlikely parameter/data combinations. **apop_estimate_restart takes two models, rather than one model and some haphazard settings. --apop_plot_query no longer forces you to use the -d and -q switches to specify the database and query. **The two places that use regular expressions: apop_opts.db_nan and the search for a name via apop_data_get/set... use case-insensitive EREs. Before I'd been using BREs, which nobody likes. -- stops the MLE searches, prints output, and continues the program. Especially useful for simulated annealing. [GDB tip: use the command: signal SIGINT ] --apop_mle_fix_params debugging: gradients work now. --apop_test_ANOVA added, to test the null hypothesis that all cells of a crosstab are equally likely. **apop_multivariate_normal_prob removed. Use the apop_multivariate_normal model and its .log_likelihood, .p, .draw, et cetera. **sed -i -e "s/apop_OLS_params/apop_ls_settings/g" -e "s/apop_mle_params/apop_mle_settings/g" *.c *.h September 2007 --The optimization methods now have an enumerated type. **apop_opts.mle_trace_path is now the trace_path element of the apop_mle_params struct. Also, it works much better. --apop_histogram_normalize function --improvements to apop_kernel_density and apop_histogram_print August 2007 --removed apop_model_template. Just copy one of the existing models. --apop_data_ptr_ti, apop_data_ptr_ii, apop_data_ptr_it --And you can never have too many bug fixes July 2007 --apop_binomial model takes two types of input now: a two-column form with hit count and miss count, and a list of binary hits or misses. --apop_lognormal model --bug fixes on Information matrix calculation. June 2007 [subversion ate this part; sorry.] May 2007 --apop_query_to_mixed_data **apop_produce_dummies now makes dummy variables from both data and text. This means that there's another parameter you need to set to 'd' or 't' to indicate what you want dummified. !!**Merged the apop_params and apop_model structures, leaving everything in just one struct. That's about all the merging left. --apop_text_alloc and apop_text_add to make text manipulation a little easier. --apop_matrix_apply_all and apop_matrix_map_all operate on all items in a matrix. **small tweak to apop_vector_normalize interface. --apop_matrix_inverse and apop_matrix_determinant, because the apop_det_and_inv interface is sort of ugly. --APOP_SUBMATRIX macro --MLEs put the expected score in the ->more element of the returned apop_model. If people find this useful, we can maybe put a proper expected_score element in the model. --More consts in function headers. You can decide whether this is actually useful. 0.19 April 2007 !!**Eliminated the apop_estimate and apop_ep structures, replacing them with the apop_params structure. The apop_params + apop_model pair form a closure representing a parametrized model. Expect the uses element to go away soon; after that, things should be stable. Parameters for individual methods now have their own space; try apop_ml_params_alloc and apop_OLS_params_alloc, for example. If you are just doing things like apop_estimate_show(apop_OLS.estimate(data,NULL)); then don't worry, but if you are doing a lot with the input parameters, then have a look at Chapter 6 of _Modeling with Data_. --apop_histogram model **Gradually rewriting the histogram functions from before to make use of that model. E.g., eliminated apop_vector_to_cmf. **apop_line_to_data fixed to use both vector and matrix terms. Now requires arguments: (indata, vsize, m1size, m2size). --MLE now approximates the Information matrix using data gathered during the MLE search. This is wrong but cheap; right but expensive procedures forthcoming. [Hint: Simulated annealing gathers more info.] --apop_(data|matrix|vector)_fill functions, which are a touch fragile, but very useful when used with care. **apop_data_(get|set)_(tn|nt) changed to (ti|it), because n could stand for name or number, while i stands for index, and is often used for integers. --apop_names now have a title element, so you can give your data structures a title. **apop_params_alloc takes an apop_model, not an &apop_model. It's more natural that way. **Finally erradicated every last vestige of inventories: apop_params no longer has a .uses element. Instead, apop_specific_model_params may have a want_cov, want_expected_value, want_whatever element if the element is optional. And really, the parameters themselves should never be optional. What was I thinking. **apop_model_fix_params now sets up and returns an apop_mle_params object, thus resolving the problem that the MLE params needs a model input, and the model_fix_params model needed an MLE params input. 0.18 March 2007 **apop_text_to_db now assumes column names unless you specify -nc. --If you set parameter_ct==0 in your model definition, the MLEs will assign parameter_ct == the number of columns in your data set. --Missing data functions: apop_listwise_delete and apop_ml_imputation **The constraint element of the apop_model now takes a void* parameter, like it should have all this time. **apop_jackknife (1) renamed to apop_jackknife_cov and (2) now actually works. **Entirely eliminated the apop_inventory structure. Its sole utility is inside the apop_ep struct. **Changed the RNG interface for the sake of allowing multidimensional draws. [Not that I have any functions that do that right now.] --Bayes-oriented MCMC algorithm: apop_mcmc_update !!Bayesian model generator: apop_update **apop_model paramters is now an apop_model. See the documentation of the model for all the changes. **apop_data_alloc now takes three arguments: vsize, msize1, msize2. To update, just put a 0, at the head of the arg list. !!An absolutely fabulous apop_linear_constraint function. --Produce a model with some parameters fixed via apop_model_fix_params. --apop_beta model February 2007 **Apop_sv_decomposition has a slightly nicer interface. **data->categories was too much to type, and too specific. The apop_data struct now has a data->text element, and textsize[0] and [1]. The categories element is linked to this, but is now deprecated. January 2007 **Root finding hooked into the max likelihood fns. 0.17 December 2006 **Apop_model struct has lost the fdf object, which was annoying, and now has the p function. --mySQL support. **apop_query_to_chars now returns an apop_data structure, so you don't have to go back and gather column names and dimensions. **apop_name_get (and the apop_get_tt family) now use regular expressions instead of SQL's LIKE operator. This is _much_ faster. --the apop_distribution model. November 2006 --The preeminently useful APOP_COL and APOP_ROW --apop_data_calloc --apop_vector_(apply|map) debugged. **apop_estimation_params is just too darn long; reduced to apop_ep. October 2006 --apop_text_to_db now reads from STDIN. **deleted apop_query_db; use apop_query. --Kolmogorov-Smirnov test. --apop_t_test returns GSL_NAN when given a one-element vector instead of hanging. --no more soft links in the tgz file==>may work better on Windows machines. --apop_(vector|matrix)_(map|apply) will apply a function to every row of a matrix or every element of a vector. The map functions return a gsl_vector. --bug fix in apop_test_goodness_of_fit. September 2006 **Removed apop command-line server thing. It was interesting, but that's the best that could be said of it. --Added functions for weighted data: weighted least squares, weighted moments. --apop_vector_percentiles now allows for averaging instead of rounding. August 2006 **apop_log_likelihood and friends now demand that data be apop_data*, rather than void*. Too many things broke when users gave non-apop_data data. --bug fixes July 2006 --Many more checks for NULL ==> more robust code and easier debugging. --Bug fixes. **apop_data_split, and apop_data_stack has been revised to handle the idea that a vector is the -1st element of the matrix. I.e., check your code if you're trying to merge matrices without merging the vectors. --Lattice plots. --Convenience t-tests from inside model estimations are fixed. --apop_query_to_vector. --apop_opts.output_type == 'p' to print to apop_opts.output_pipe **apop_..._print and apop_..._show now work out whether elements are integers (if (val == (int) val)...), and print accordingly. This means that apop_..._print_int and apop_..._show_int are basically obsolete, and have been removed. --Apop_OLS now allows weights. --Test library now includes a few NIST certified tests. June 2006 --added preprocessor cruft to let the library work for C++ --Jackknife revised May 2006 **The apop_model no longer includes an inventory. I leave it to the estimate function to do its own allocation. April 2006 **apop_matrix_normalize and apop_vector_normalize had different numbers for the same normalizations. Was that ever dumb. Also, I've switched to chars instead of ints to signify this stuff, for better mnemonics without resorting to the APOP_ENUM_YOU_HAVE_TO_LOOK_UP_EVERY_TIME_BECAUSE_ITS_SO_LONG sort of thing. If you were using apop_matrix_normalize(data, 0) before, you need to change that to using apop_matrix_normalize(data, 'm'). Thus, apop_vector_normalize now has one more normalization, for a total of four for both. --Added apop_rng_alloc convenience fn. --Added apop_strip_dots to keep inputs to the database healthy. --apop_name_find uses LIKE instead of strcmp. --a fn to calculate the generalized harmonic. --A whole section on histograms and goodness-of-fit tests. --apop_data_set fns to go with the apop_data_get fns. --apop_data now includes a vector type --apop_estimate.parameters is now an apop_data type. --apop_estimate.names is thus obsolete. March 2006 **apop_inventory is now a subset of apop_estimation_params. Implications: --added apop_estimation_params_alloc() to ensure that inventory is set right. --the model.estimate(data, inv, params) method is now model.estimate(data, params) model.estimate(data, NULL) still does what the user expects it to. This makes structural sense, but will lightly break any existing code. fix: change apop_inventory *inv = apop_inventory_set(1); model.estimate(data, inv, NULL); to apop_estimation_params *ep = apop_estimation_params_alloc(); model.estimate(data, ep); and in any apop_estimates, change any use of est->uses to est.estimation_params.uses. **Next apop_estimate reform: y_values and residuals combined into one apop_data table with actual, predicted, residual columns. --obviated the need for a 'dependent' element in apop_names; removed that. If you need the name, it's now your_est->dependent->names->colnames[0]. **your_estimate->covariance is now an apop_data set instead of a gsl_matrix. **the data element of the apop_matrix structure is now named matrix. So instead of data_set->data, use data_set->matrix, and instead of estimate->data->data->data, you can use estimate->data->matrix->data. --The command-line utility has been revisited, and can do a few more things, like OLS. --Simulated annealing --added convenience fns apop_vector_distance(pt1,pt2) and apop_vector_grid_distance(pt1,pt2) **Apop_data_memcpy no longer malloc()s for you, for comparability with the world's other memcpy fns. If you want mallocing, use apop_data_copy. --apop_test_fisher_exact(). Cut 'n' pasted from R, who cut 'n' pasted it from somebody else. Despite being the same code, it runs fifty (50) times faster from Apophenia. February 2006 --sort-of-adaptive MLE: use apop_estimate_restart to execute a new MLE search beginning where the last one ended, perhaps using a new method or rescaled parameters. --This needed convenience functions to check for divergence, thus added apop_vector_finite, apop_vector_bounded, apop_vector_isnan. **apop_db_to_crosstab now returns an apop_data set instead of a gsl_matrix. Also, it finally works with column headers that aren't numeric. **stats like apop_mean are now apop_vector_mean, following the proper pkg-noun-verb naming scheme. --Textbook is much improved. --apop_vector_to_pdf convenience fn. --Some of the fns that used to be of the form apop_get_something(input, &output); are now of the more natural out = apop_get_something(input); This includes apop_array_to_vector and apop_array_to_matrix --bootstrapping works, and works with with apop_models. --apop_poisson model 0.15 January 2006 Added an apop_opts structure for options. Alowed the following changes: --apop_verbose is now apop_opts.verbose. Try this on your existing code: perl -pi.bak -e 's/apop_verbose/apop_opts.verbose/g' *.c *.h --the output functions now output into three formats: on screen, to file, to db; see chapter five of the manual. --F tests --R squared. 0.14 December 2005 The apop_data structure, which is just a shell for a gsl_matrix and an apop_name. Was just sick of sending names following around my tables. Lets us keep both numerical and categorical data in one place; kind of like R's data frame. --Added linear model objects: OLS, GLS. This means that what had been the apop_OLS function is now the apop_estimate_OLS function, and where it used to take in a gsl_matrix and an apop_name, now it takes an apop_data structure and a NULL. So you'll have to modify your code accordingly. --A function to generate dummy variables, useful in conjunction with --Functions to stack matrices and apop_data sets. Even a apop_partitioned_OLS function, that will only practically work for small data sets. --pow(.,.) in the database. I can't believe I dealt with SQL this long w/o it. 0.13 December 2005 --The apop_model object. This was a big deal that deserves more than just one line; see the manuals. 0.12 mid November 2005 --Bar charts (assuming you've got Gnuplot > 4.1) --percentiles (in case you haven't got it) **redid MLE system so you can pick among the many options now available. As a part of this: --better handling of constraints. --numerical gradients. --numerical Hessians. 0.12 early November 2005, post-hiatus --You now have three maximum likelihood estimators to choose from: the GSL's no-gradient, the GSL's with-gradient, and Mr. WN's autocalculated gradient. --If you haven't seen it before, the apop_distribution structure is increasingly well-supported. It allows the user to specify the features of the Max. Likeihood model in a consistent manner which facilitates things like comparing two models. --I'd still suggest taking the Waring and Yule distributions with care; everything else seems to check out. 0.12 September 2005 **The distributions are now objects, which just provides a neat way of grouping together the half-dozen functions which are associated with any one distribution. 0.11 September 2005 --command-line server is much improved. I actually do work with it. --Documentation is now via doxygen. --asst bug fixes. --Have started to take plotting (via gnuplot) seriously --a limited test suite. Try: make test . 0.10 August 2005 --This version includes a server to park itself in memory and receive data processing requests. The intent is that one can then do analysis from the command line or a Perl/Python/Whatever script. The client/server works in the sense of handling a handful of requests without segfaulting, but remains in proof-of-concept stage. --Added apop_merge_db for joining databases, both via C and command line --Run t tests from the cmd line or the database. 0.09 July 2005 --Flattened the relatively complex vasprintf subsystem from GNU, so if you've been having trouble compiling on non-GNU systems, try again. Added two little command-line programs. Also, added more little functions which aren't very interesting, like t-tests; maybe you'll stumble upon them. 0.08 May 2005 --OLS/GLS/MLE now properly support the apop_estimate structure --Column names 0.07 April 2005 --uses the apop_estimate structure to return heaps of data from regressions & MLEs --uses the apop_model structure 0.06 --var(x), skew(x), kurtosis(x) added to SQL understood by Apophenia. 0.05 --added a little crosstab utility --queries now accept printf-type arguments. ==>GNU vasprintf was added. ==>updated to work with autoconf 1.7 apophenia-1.0+ds/INSTALL000066400000000000000000000366101262736346100147710ustar00rootroot00000000000000Installation Instructions ************************* Copyright (C) 1994-1996, 1999-2002, 2004-2013 Free Software Foundation, Inc. Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without warranty of any kind. Basic Installation ================== Briefly, the shell command `./configure && make && make install' should configure, build, and install this package. The following more-detailed instructions are generic; see the `README' file for instructions specific to this package. Some packages provide this `INSTALL' file but do not implement all of the features documented below. The lack of an optional feature in a given package is not necessarily a bug. More recommendations for GNU packages can be found in *note Makefile Conventions: (standards)Makefile Conventions. The `configure' shell script attempts to guess correct values for various system-dependent variables used during compilation. It uses those values to create a `Makefile' in each directory of the package. It may also create one or more `.h' files containing system-dependent definitions. Finally, it creates a shell script `config.status' that you can run in the future to recreate the current configuration, and a file `config.log' containing compiler output (useful mainly for debugging `configure'). It can also use an optional file (typically called `config.cache' and enabled with `--cache-file=config.cache' or simply `-C') that saves the results of its tests to speed up reconfiguring. Caching is disabled by default to prevent problems with accidental use of stale cache files. If you need to do unusual things to compile the package, please try to figure out how `configure' could check whether to do them, and mail diffs or instructions to the address given in the `README' so they can be considered for the next release. If you are using the cache, and at some point `config.cache' contains results you don't want to keep, you may remove or edit it. The file `configure.ac' (or `configure.in') is used to create `configure' by a program called `autoconf'. You need `configure.ac' if you want to change it or regenerate `configure' using a newer version of `autoconf'. The simplest way to compile this package is: 1. `cd' to the directory containing the package's source code and type `./configure' to configure the package for your system. Running `configure' might take a while. While running, it prints some messages telling which features it is checking for. 2. Type `make' to compile the package. 3. Optionally, type `make check' to run any self-tests that come with the package, generally using the just-built uninstalled binaries. 4. Type `make install' to install the programs and any data files and documentation. When installing into a prefix owned by root, it is recommended that the package be configured and built as a regular user, and only the `make install' phase executed with root privileges. 5. Optionally, type `make installcheck' to repeat any self-tests, but this time using the binaries in their final installed location. This target does not install anything. Running this target as a regular user, particularly if the prior `make install' required root privileges, verifies that the installation completed correctly. 6. You can remove the program binaries and object files from the source code directory by typing `make clean'. To also remove the files that `configure' created (so you can compile the package for a different kind of computer), type `make distclean'. There is also a `make maintainer-clean' target, but that is intended mainly for the package's developers. If you use it, you may have to get all sorts of other programs in order to regenerate files that came with the distribution. 7. Often, you can also type `make uninstall' to remove the installed files again. In practice, not all packages have tested that uninstallation works correctly, even though it is required by the GNU Coding Standards. 8. Some packages, particularly those that use Automake, provide `make distcheck', which can by used by developers to test that all other targets like `make install' and `make uninstall' work correctly. This target is generally not run by end users. Compilers and Options ===================== Some systems require unusual options for compilation or linking that the `configure' script does not know about. Run `./configure --help' for details on some of the pertinent environment variables. You can give `configure' initial values for configuration parameters by setting variables in the command line or in the environment. Here is an example: ./configure CC=c99 CFLAGS=-g LIBS=-lposix *Note Defining Variables::, for more details. Compiling For Multiple Architectures ==================================== You can compile the package for more than one kind of computer at the same time, by placing the object files for each architecture in their own directory. To do this, you can use GNU `make'. `cd' to the directory where you want the object files and executables to go and run the `configure' script. `configure' automatically checks for the source code in the directory that `configure' is in and in `..'. This is known as a "VPATH" build. With a non-GNU `make', it is safer to compile the package for one architecture at a time in the source code directory. After you have installed the package for one architecture, use `make distclean' before reconfiguring for another architecture. On MacOS X 10.5 and later systems, you can create libraries and executables that work on multiple system types--known as "fat" or "universal" binaries--by specifying multiple `-arch' options to the compiler but only a single `-arch' option to the preprocessor. Like this: ./configure CC="gcc -arch i386 -arch x86_64 -arch ppc -arch ppc64" \ CXX="g++ -arch i386 -arch x86_64 -arch ppc -arch ppc64" \ CPP="gcc -E" CXXCPP="g++ -E" This is not guaranteed to produce working output in all cases, you may have to build one architecture at a time and combine the results using the `lipo' tool if you have problems. Installation Names ================== By default, `make install' installs the package's commands under `/usr/local/bin', include files under `/usr/local/include', etc. You can specify an installation prefix other than `/usr/local' by giving `configure' the option `--prefix=PREFIX', where PREFIX must be an absolute file name. You can specify separate installation prefixes for architecture-specific files and architecture-independent files. If you pass the option `--exec-prefix=PREFIX' to `configure', the package uses PREFIX as the prefix for installing programs and libraries. Documentation and other data files still use the regular prefix. In addition, if you use an unusual directory layout you can give options like `--bindir=DIR' to specify different values for particular kinds of files. Run `configure --help' for a list of the directories you can set and what kinds of files go in them. In general, the default for these options is expressed in terms of `${prefix}', so that specifying just `--prefix' will affect all of the other directory specifications that were not explicitly provided. The most portable way to affect installation locations is to pass the correct locations to `configure'; however, many packages provide one or both of the following shortcuts of passing variable assignments to the `make install' command line to change installation locations without having to reconfigure or recompile. The first method involves providing an override variable for each affected directory. For example, `make install prefix=/alternate/directory' will choose an alternate location for all directory configuration variables that were expressed in terms of `${prefix}'. Any directories that were specified during `configure', but not in terms of `${prefix}', must each be overridden at install time for the entire installation to be relocated. The approach of makefile variable overrides for each directory variable is required by the GNU Coding Standards, and ideally causes no recompilation. However, some platforms have known limitations with the semantics of shared libraries that end up requiring recompilation when using this method, particularly noticeable in packages that use GNU Libtool. The second method involves providing the `DESTDIR' variable. For example, `make install DESTDIR=/alternate/directory' will prepend `/alternate/directory' before all installation names. The approach of `DESTDIR' overrides is not required by the GNU Coding Standards, and does not work on platforms that have drive letters. On the other hand, it does better at avoiding recompilation issues, and works well even when some directory options were not specified in terms of `${prefix}' at `configure' time. Optional Features ================= If the package supports it, you can cause programs to be installed with an extra prefix or suffix on their names by giving `configure' the option `--program-prefix=PREFIX' or `--program-suffix=SUFFIX'. Some packages pay attention to `--enable-FEATURE' options to `configure', where FEATURE indicates an optional part of the package. They may also pay attention to `--with-PACKAGE' options, where PACKAGE is something like `gnu-as' or `x' (for the X Window System). The `README' should mention any `--enable-' and `--with-' options that the package recognizes. For packages that use the X Window System, `configure' can usually find the X include and library files automatically, but if it doesn't, you can use the `configure' options `--x-includes=DIR' and `--x-libraries=DIR' to specify their locations. Some packages offer the ability to configure how verbose the execution of `make' will be. For these packages, running `./configure --enable-silent-rules' sets the default to minimal output, which can be overridden with `make V=1'; while running `./configure --disable-silent-rules' sets the default to verbose, which can be overridden with `make V=0'. Particular systems ================== On HP-UX, the default C compiler is not ANSI C compatible. If GNU CC is not installed, it is recommended to use the following options in order to use an ANSI C compiler: ./configure CC="cc -Ae -D_XOPEN_SOURCE=500" and if that doesn't work, install pre-built binaries of GCC for HP-UX. HP-UX `make' updates targets which have the same time stamps as their prerequisites, which makes it generally unusable when shipped generated files such as `configure' are involved. Use GNU `make' instead. On OSF/1 a.k.a. Tru64, some versions of the default C compiler cannot parse its `' header file. The option `-nodtk' can be used as a workaround. If GNU CC is not installed, it is therefore recommended to try ./configure CC="cc" and if that doesn't work, try ./configure CC="cc -nodtk" On Solaris, don't put `/usr/ucb' early in your `PATH'. This directory contains several dysfunctional programs; working variants of these programs are available in `/usr/bin'. So, if you need `/usr/ucb' in your `PATH', put it _after_ `/usr/bin'. On Haiku, software installed for all users goes in `/boot/common', not `/usr/local'. It is recommended to use the following options: ./configure --prefix=/boot/common Specifying the System Type ========================== There may be some features `configure' cannot figure out automatically, but needs to determine by the type of machine the package will run on. Usually, assuming the package is built to be run on the _same_ architectures, `configure' can figure that out, but if it prints a message saying it cannot guess the machine type, give it the `--build=TYPE' option. TYPE can either be a short name for the system type, such as `sun4', or a canonical name which has the form: CPU-COMPANY-SYSTEM where SYSTEM can have one of these forms: OS KERNEL-OS See the file `config.sub' for the possible values of each field. If `config.sub' isn't included in this package, then this package doesn't need to know the machine type. If you are _building_ compiler tools for cross-compiling, you should use the option `--target=TYPE' to select the type of system they will produce code for. If you want to _use_ a cross compiler, that generates code for a platform different from the build platform, you should specify the "host" platform (i.e., that on which the generated programs will eventually be run) with `--host=TYPE'. Sharing Defaults ================ If you want to set default values for `configure' scripts to share, you can create a site shell script called `config.site' that gives default values for variables like `CC', `cache_file', and `prefix'. `configure' looks for `PREFIX/share/config.site' if it exists, then `PREFIX/etc/config.site' if it exists. Or, you can set the `CONFIG_SITE' environment variable to the location of the site script. A warning: not all `configure' scripts look for a site script. Defining Variables ================== Variables not defined in a site shell script can be set in the environment passed to `configure'. However, some packages may run configure again during the build, and the customized values of these variables may be lost. In order to avoid this problem, you should set them in the `configure' command line, using `VAR=value'. For example: ./configure CC=/usr/local2/bin/gcc causes the specified `gcc' to be used as the C compiler (unless it is overridden in the site shell script). Unfortunately, this technique does not work for `CONFIG_SHELL' due to an Autoconf limitation. Until the limitation is lifted, you can use this workaround: CONFIG_SHELL=/bin/bash ./configure CONFIG_SHELL=/bin/bash `configure' Invocation ====================== `configure' recognizes the following options to control how it operates. `--help' `-h' Print a summary of all of the options to `configure', and exit. `--help=short' `--help=recursive' Print a summary of the options unique to this package's `configure', and exit. The `short' variant lists options used only in the top level, while the `recursive' variant lists options also present in any nested packages. `--version' `-V' Print the version of Autoconf used to generate the `configure' script, and exit. `--cache-file=FILE' Enable the cache: use and save the results of the tests in FILE, traditionally `config.cache'. FILE defaults to `/dev/null' to disable caching. `--config-cache' `-C' Alias for `--cache-file=config.cache'. `--quiet' `--silent' `-q' Do not print messages saying which checks are being made. To suppress all normal output, redirect it to `/dev/null' (any error messages will still be shown). `--srcdir=DIR' Look for the package's source code in directory DIR. Usually `configure' can determine that directory automatically. `--prefix=DIR' Use DIR as the installation prefix. *note Installation Names:: for more details, including other options available for fine-tuning the installation locations. `--no-create' `-n' Run the configure checks, but stop before creating any output files. `configure' also accepts some other, not widely useful, options. Run `configure --help' for more details. apophenia-1.0+ds/Makefile.am000066400000000000000000000036041262736346100157710ustar00rootroot00000000000000ACLOCAL_AMFLAGS = -I m4 AUTOMAKE_OPTIONS = \ dist-xz \ dist-bzip2 \ dist-zip AM_DISTCHECK_CONFIGURE_FLAGS ?= \ --disable-maintainer-mode \ --enable-extended-tests AM_CFLAGS = -g -Wall -O3 ## Library versioning (C:R:A == current:revision:age) ## 0.999b 0:0:0 ## 0.999c 1:0:0 ## 0.999e 2:0:0 LIBAPOPHENIA_LT_VERSION = 2:0:0 SUBDIRS = transform model . cmd eg tests docs include_HEADERS = apop.h pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA= apophenia.pc lib_LTLIBRARIES = libapophenia.la libapophenia_la_LD_VERSION_SCRIPT= if HAVE_LD_VERSION_SCRIPT libapophenia_la_LD_VERSION_SCRIPT+= -Wl,--version-script=$(top_srcdir)/apophenia.map endif SUBLIBS = \ libapopkernel.la \ transform/libapoptransform.la \ model/libapopmodel.la libapophenia_la_SOURCES = \ asprintf.c noinst_LTLIBRARIES = libapopkernel.la noinst_HEADERS = apop_internal.h libapopkernel_la_SOURCES = \ apop_arms.c \ apop_asst.c \ apop_bootstrap.c \ apop_conversions.c \ apop_data.c \ apop_db.c \ apop_fexact.c \ apop_hist.c \ apop_linear_algebra.c \ apop_linear_constraint.c \ apop_mapply.c \ apop_mcmc.c \ apop_missing_data.c \ apop_mle.c apop_model.c \ apop_name.c \ apop_output.c \ apop_rake.c \ apop_regression.c \ apop_settings.c \ apop_sort.c \ apop_stats.c \ apop_tests.c \ apop_update.c \ apop_vtables.c apop_db_INCLUDES = \ apop_db_mysql.c \ apop_db_sqlite.c apop_db.c: $(apop_db_INCLUDES) libapopkernel_la_CFLAGS = \ $(PTHREAD_CFLAGS) \ $(OPENMP_CFLAGS) \ $(MYSQL_CFLAGS) \ $(SQLITE3_CFLAGS) \ $(GSL_CFLAGS) libapophenia_la_LDFLAGS = \ -version-info $(LIBAPOPHENIA_LT_VERSION) \ $(libapophenia_la_LD_VERSION_SCRIPT) libapophenia_la_LIBADD = \ $(SUBLIBS) \ $(MYSQL_LDFLAGS) \ $(SQLITE3_LDFLAGS) \ $(GSL_LIBS) \ $(PTHREAD_LIBS) \ $(LIBM) EXTRA_DIST = \ rpm.spec \ apophenia.pc.in \ apophenia.map EXTRA_DIST += \ $(apop_db_INCLUDES) ## compatibility doc: -$(MAKE) -C docs doc apophenia-1.0+ds/NEWS000066400000000000000000001362431262736346100144420ustar00rootroot00000000000000Since April 2014, this changelog includes only those changes that break existing code or are especially notable. For a full list of changes, see the Git history at https://github.com/b-k/Apophenia/commits/master [Key, pre-March 2014: --Addition or improvement **Change that could require recoding existing code. !!Big. ] October 2014 ** apop_model_stack --> apop_model_cross August 2014 ** default for apop_data_pack is .all_pages='y' (was 'n'). ** remove apop_plot_lattice, apop_plot_triangle, apop_plot_line_and_scatter, apop_plot_qq. Find them at https://github.com/b-k/apophenia/wiki/gnuplot_snippets . March 2014 --Command-line tools print help should a user add a --help option. February 2014 !!OpenMP for threading. All calls to apop_map and friends, apop_model_draws, and others auto-thread. !!apop_rng_get_thread to get a thread-specific RNG, so you can thread random processes. January 2014 **View macro reform: Apop_cols Apop_rows A contiguous set of columns or rows as an apop_data set (with names) Apop_col Apop_row One column or row as an apop_data set Apop_col_t Apop_row_t One column or row as an apop_data set, retrieved by row/col name Apop_col_v Apop_row_v One column or row as a gsl_vector Apop_col_tv Apop_row_tv One column or row as a gsl_vector, retrieved by row/col name Apop_matrix_col Apop_matrix_row One column or row as a gsl_matrix **MLE methods are now strings instead of all-caps enums. **All blank elements of a data->text grid point to the same NUL string. --Add apop_model_metropolis; revise apop_update accordingly. --apop_draw uses metropolis to draw from any model with a log likelihood/p and where data size>1. **Replace all instances of output_file with output_name (GNU sed -i 's/output_file/output_name/g' *c) --Consolidated headers **apop.h no longer #includes time.h or unistd.h **apop_draw returns zero on success; nonzero on failure. **Removed the BIC-by-cells from estimation output. Added AIC_c. OLS now reports the ICs along with R^2. December 2013 --append and replace options for apop_text_to_db --apop_probit bug fix **apop_plot_histograms now uses gnuplot's impulses, not boxes by default---handles missing zero bins better. **MLE path trace lists the probs/loglikelihoods in the vector of the apop_data set it produces. path is apop_data**, from apop_data*. **apop_data_transpose has an .inplace option, which is 'y' by default. Add .inplace='n' to existing uses. --siman checks constraints for the starting point. --Mixture models overhauled. --cleaned up the command-line utilities **removed apop_lookup. November 2013 --rewrite apop_data_sort to allow sorting by multiple columns or text or names --apop_pmf now has a CDF method. --fixed up K-S tests. --removed the Swig Python wrapper from this package. **replaced char apop_opts.db_nan[101] with char *apop_opts.nan_string. More descriptive, easier to use. **apop_name_find does plain case-insensitive search; no regexes. October 2013 **all built-in models (apop_ols, apop_dirichlet, ...) are now apop_model* (ptr-to-struct), from apop_model (plain struct). **apop_estimate and apop_copy take in an apop_model* instead of plain apop_model. **printing no longer part of the apop_model struct; uses a vtable. September 2013 **change vbase, m1base, m2base ==> vsize, msize1, msize2 **Estimate returns void (was apop_model*) --vtable mechanism improvements **Remove score, predict, and parameter_model from the apop_model object; use the vtable mechanism. **Upgrade model p, ll, cdf, constraint to return long double (was double) **consolidate vector_var and vector_weighted_var. same with cov, mean, weighted_skew, and weighted_pop. Users just have to replace apop_vector_weighted_var with apop_vector_var. **removed deprecated.h entirely. --apop_data_add_named_elmts puts new data in the vector, not the matrix, because it is intended for a list of scalars (==a vector). If you use apop_data_get(infodata, .rowname="statistic name") then you'll be able to retrieve the element either way. **removed apop_line_to_data and apop_line_to_matrix. Use apop_data_fill and apop_data_falloc. August 2013 --apop_map(_sum) properly threads data-row mappings. .inplace='v' to return NULL. **Remove apop_settings_alloc, apop_settings_group_alloc **Change Apop_row to return an apop_data set, not a vector (for which use Apop_matrix_row). **Apop_settings_set sets model->error='s' on error, instead of returning. **Add a .want_path='y' setting to your apop_mle_settings group, and I'll put a list of the points tried by the optimizer in an apop_data set named path (in the settings group); see documentation for details. Remove the former trace_path mechanism. **removed apop_(vector|matrix)_increment. Use, e.g., *gsl_vector_ptr(v, 7) += 3; or (*gsl_vector_ptr(v, 7))++. **Some mu and sigmas => μ and σ **Removed apop_settings_alloc, apop_settings_group_alloc **Change Apop_row to return an apop_data set, not a vector (for which use Apop_matrix_row). June 2013 --replaced makefile in base directory with ./configure. --version number now equals build date. **name->title is a ptr; name->column => name->col **removed apop_strip_dots; it's up to the user to give reasonable names for the db. May 2013 --jacobian transformations --Apop_model_copy_set to copy a model and add a settings group at once --mixture models --data-data composition --added apop_model_draws !!vtables, allowing for more functions with special cases for certain model(s) outside of the model object itself. April 2013 --plugged some memory leaks --default tolerance for MLE is much finer (1e-5). --Added apop_text_fill **Finally removed support for gsl_histograms, including the apop_histogram model. This cut has been promised for about four years now. Use the apop_pmf instead. **apop_data_to_bins no longer modifies the input data in place. It now makes a copy and modifies the copy. **Removed apop_crosstab_to_pmf. There's a version at https://github.com/b-k/Apophenia/wiki/Crosstab-to-PMF for matrices. **Removed apop_vector_to_array. If your array has stride 1, use your_array->data; else write a for loop to copy out the data. **Removed apop_array_to_matrix. March 2013 **apop_text_paste now prints the pasted string at verbosity level 3 (formerly 2). February 2013 --Starting_point in Bayesian updating no longer does anything, but it was never significant to begin with. Added some verbosity options to apop_update. January 2013 --Logit regression much smarter about picking a starting point. --Defaults for simulated annealing try 1600x fewer points. Prior settings were overkill. --configure.ac checks for native asprintf and uses it if it is present **Removed apop_db_merge and apop_db_merge_table. Get them from http://modelingwithdata.org/arch/00000141.html . **Removed apop_matrix_fill; use apop_data_fill **Removed apop_array_to_data. **Removed apop_matrix_correlation; use apop_data_correlation **Deprecated apop_rank_compress; use apop_data_to_dummies(..., .keep_first='y'). --apop_model_stack December 2012 --Added an apop_pmf_settings group, eliminating a couple of hacks (e.g., see October 2012). --faster read-in of text files --Exponential model uses data in both the vector and matrix parts of the apop_data input --transformation to generate mixtures (i.e., linear combinations) of models. **apop_data_transpose now transposes the text element as well as the matrix (by default). Use apop_data_transpose(your_matrix, .transpose_text='n') to replicate the previous behavior. --removed Autoconf pkg-config macros, because Autoconf no longer needs the help. --writing apop_data to DB uses prepared queries where possible => much faster. **It is now up to you to put apop_query("begin"); / apop_query("commit"); wrappers around writing of tables to the database. November 2012 --apop_vector_unique_elements, apop_data_to_dummies, and apop_data_to_factors handle NaNs better; put them at the end of the sort order. !!Finally added an .error element to apop_data and apop_model structs, thus simplifying error-checking. --Apop_stopif macro, rendering the Apop_assert family largely obsolete (so if you're using them in your own work, consider them deprecated...). --Where the assert macros used to abort() on errors, they now send signal(SIGTRAP), making debugging a little easier. Most host systems force an abort on SIGTRAP anyway. October 2012 **Removed apop_strcmp. If you still need it, this macro is basically equivalent: #define apop_strcmp(a, b) (((a)&&(b) && !strcmp((a), (b))) || (!(a) && !(b))) --clean up of parameter-fixing model transformation. --split off multiple imputation variance code. **Removed apop_vector_grid_distance. Use apop_vector_distance(v1, v2, .metric='M'); --If apop_pmf.dsize==0, apop_pmf.draw returns a row number, not the data in that row. This will change shortly. --Logit draw function, akin to apop_ols.draw. Both will change shortly. --apop_data_to_factors now auto-allocates a matrix if need be (because it always auto-allocated a vector). September 2012 --\0 in text files counts as white space --fixed counting bug for text files with , sequences. July 2012 --some MLE cleanup --fixes to apop_rake --apop_logit.score fixed June 2012 --The sample kurtosis calculation is still more precise. May 2012 --Autotools improvements. Use the standard 'make check' instead of the ad hoc 'make test'. !!Set apop_opts.stop_on_warning='n' to never abort() on any type of error. E.g., GUIs that should never halt will use this. Default is still to halt on errors, because that's most useful for interactively developing numeric analyses. --Use apop_opts.log_file=fopen("yourlog", "w") to divert the warnings/errors from stderr into yourlog. --Some formerly void functions now return an int, to return an error code. E.g., apop_opts.stop_on_warning='n'; if (apop_data_set(data, row, col)) printf("Error! Nothing was set.\n"); --Fixed a memory leak in simulated annealing. --Apop_data_row and apop_data_set_row handle row names March 2012 --bug fix in apop_text_to_data when input file has no names. --probit dlog likelihood isn't implemented for N>2; now acknowledging this. --added apop_data_get_factor_names --apop_(vector|matrix)_(map|apply) now accepts NULL input. January 2012 **Removed apop_matrix_var_m, which nobody was using. --bug fix in apop_vector_distance for Ln norms where n is odd. --reading data from text files rewritten; much more robust. **A space is no long a default delimiter. Use apop_opts.input_delimiters="| ,\t" to restore old default. **apop_opts.db_nan is no longer a regex; I just do a case-insensitive comparison. December 2011 --apop_model.textsize is now a size_t instead of an int. --apop_update accepts likelihoods with no pre-set parameters November 2011 --moved to Github; some changes to structure and documentation to accommodate. **apop_ols.predict didn't do the OLS shuffle if the input has no vector; this was anomalous. October 2011 --apop_text_paste added. **apop_multinomial and apop_binomial overhauled. No longer accepting Bernoulli draws as input. --standardization: make docs => make doc September 2011 --`query turned up a blank table' warning turns up when apop_opts.verbose >=2. (used to be >=1) --apop_t_test and apop_paired_t_test are quieter---no intermediate results until apop_opts.verbose >=2. **apop_opts.db_name_column now has a blank default (instead of the SQL-specific and potentially surprise-inducing "rowname"). August 2011 --Bootstrap/Jackknife are better with text **apop_data_memcpy used to reallocate memory for the text and names elements; use apop_data_copy if you want allocation done for you. July 2011 --Apop_data_rm_rows now accepts a test function as well as a fixed list of rows to drop. June 2011 --apop_crosstab_to_db handles missing labels and NaNs better. **apop_matrix_to_db removed (as promised a few years ago). Use apop_matrix_print(yourdata, "tabname", .output_type='d'). **F-test defaults now match ANOVA tradition. --documentation script doesn't use GNU extensions to awk; should now be POSIX-standard. May 2011 **apop_map returns a data set with an allocated/filled vector when not called with .inplace='y'. Before, it had been making a full copy, which is idiosyncratic. --apop_rake accepts a weights column. --apop_anova uses variadic arguments for a marginally nicer interface (and better argument checking). **apop_data_to_dummies tries to give nicer labels. You may have to recode things if you relied on the old labels. March 2011 --Header files have been merged. A few long files is as easy to grep as a multitude of nearly one-line files. If you #include instead of the individual headers, then this shouldn't affect you. Due to redundancy, compilation with gcc takes 3% longer. 0.99 February 2011 !!The apop_PMF model has more support: --New supporting functions: apop_data_pmf_compress, apop_model_to_pmf --functions that took apop_histogram models now take apop_pmfs as well: apop_test_chi_squared_goodness_of_fit, apop_test_kolmogorov --Consider the apop_histogram to be deprecated. Only two associated functions were removed; see below. **apop_histogram_plot is removed. Replace with: fprintf(apop_opts.output_pipe, "plot '-' using 1:3 with boxes\n"); apop_model_print(hist); fprintf(apop_opts.output_pipe, "e\n"); **apop_histogram_print was a bad idea to begin with, because it basically replicates gsl_histogram_fprintf. Use apop_model_print(your_histogram), which calls gsl_histogram_fprintf, or call that function directly. The only difference: the GSL function prints [start of bin] [end of bin] [value] and apop_histogram print showed [start of bin] [value] December 2010 **apop_maximum_likelihood no longer calls apop_prep. If you want that, use apop_estimate. --apop_text_to_db lets users specify types and keys. --deleted some obsolete/deprecated items: apop_error, apop_multinomial_settings --apop_data_split retains text when splitting by rows; still ignores it when splitting by columns. November 2010 --apop_listwise_delete uses apop_opts.db_nan to check for missing data in the text part of the input data. --apop_data_split handles names September 2010 **Multinomial distribution sets N to be the length of the row (a single observation) rather than the size of the full data set. Added apop_multinomial.parameter_model method for testing purposes. August 2010 ** What was apop_assert => apop_assert_c; what was apop_assert_s => apop_assert. Their arguments are slightly different, and the thing that was asserted no longer prints along with the message you chose. --verbosity defaults to 1. Queries print at verbosity >=2. --apop_data_to_db writes the weights. --Iterative proportional fitting, aka raking. **apop_text_add now frees the contents of cell in the text grid that you are about to overwrite, thus preventing memory leaks without effort from the user. If your existing code has other pointers to the string in that text cell, you'll have to replace the now string-freeing apop_text_add with asprintf(&(your_dataset->text[row][col]), "your string"). July 2010 ** apop_regex now gets all matches when you pull substrings. Each row of the text grid is a match, and if you have multiple substrings, each match's substrings will be along the columns. May require recoding because the substrings used to be along the rows; just switch the indices. June 2010 ** Removed the apop_rank settings group, and thus all the code related to it. It was just the wrong place to do this. Added apop_data_rank_expand to convert rank data to what the various models typically expect. This is another step for some users and could be a problem if the counts get into the billions, but it still makes more sense than rewriting every model twice. May 2010 **apop_data_prune_columns_base now takes in a list of strings terminated by a NULL, not by a zero-length string. --apop_data_get_row lets you pull a view from a data set. [this was briefly the apop_data_row] ==>apop_data_set_rows, apop_data_rm_rows ==> apop_data_listwise_delete is fifty lines shorter. --apop_parts_wanted_settings: fixes some infinite loops (est needs parameter models -> p.m. bootstraps for variances -> bootstrap runs estimate -> repeat) and allows just-the-parameters estimation when you want it. --cleaned up build system, including an added RPM spec file April 2010 --apop_t_distribution now has three parameters: mean, std dev, df. That is, it is based on un-normalized data. **apop_random_int and apop_random_double removed for not being particularly useful. March 2010 **The apop_predict special case for when all data is non-missing was a bit too special, and has been eliminated---you now have to specify the first column as NaN yourself. E.g., Apop_col(data, -1, to_nan); gsl_vector_set_all(to_nan, GSL_NAN); This will make things more predictable, and save you if(!has_nans)... else... kind of statements. **Removed the prepared element of the apop_model. **apop_model_prep ==> apop_prep for consistency with other apop_model dispatch functions. apop_model_prep left for now as an alias in deprecated.h !!apop_parameter_model: a method for getting the distribution of a parameter. **Moved OLS-family test stats (pval, qval, whatever) to a page of your_estimate->info. It won't be there for long either. --settings macros let you use lowercase, thus entirely ignoring that they're macros. **apop_settings_rm_group function, which you were probably not using, changed to apop_settings_remove_group; apop_settings_get_group function => apop_settings_get_grp. Having a macro and a function that differed by a question of case was a bad idea to begin with. February 2010 !!Overhauling the output from estimations; pardon our dust. --Added CDF method to the apop_model, including apop_cdf dispatch method and default via random draws. **Defininitvely removed the residual, covariance, and llikelihood elements from the apop_model struct. The first two will be pages appended to the data and parameters, respectively, and the last will be in the Info page appended to the parameters. **Renamed apop_ls_settings (least square) to apop_lm_settings (linear model) "s/apop_ls/apop_lm/g" should work. **Sundry lists of scalars, like the R^2 table and the estimation routine's info table put the data in column zero, not column -1. In the next bullet point you'll see how this simplifies retrieval. **Added an info element to the apop_model--> more shuffling of auxiliary info. --Find results like the log likelihood or AIC via, e.g., apop_data_get(your_model->info, .rowname="log likelihood"); **Find the predicted/residuals via apop_data_get_page(your_model->info, "Predicted"); This means that the input data set is read-only again. --Find the parameter covariances via apop_data_get_page(your_model->parameters, "Covariance"); **apop_estimate_coefficient_of_determination takes in the model again. Just replace est->parameters in your argument with est. apop_ols calls this fn automatically now [apop_data_get(your_model->info, .rowname="R sq")], so you probably aren't even calling it anymore. **apop_data_add_named_elmt now writes to the zeroth element of the matrix, not the vector. So instead of apop_data_get(data, .rowname="R squared", -1), just go with apop_data_get(data, .rowname="R squared"). This affects many of the elements of the info-type matrices. --apop_data_pack, apop_data_unpack, apop_ml_impute, apop_map offer an option to use all pages. 0.23 January 2010 **expected_value element of the model renamed predict; made coherent across models. !!apop_data set now has a ->more pointer to an additional apop_data set, e.g., for data + covariances or predictions + confidence intervals. --apop_ml_imputation renamed to apop_ml_impute. "#define apop_ml_imputation apop_ml_impute" retains noun-form name, but consider it deprecated. !!apop_estimate now copies, preps, then estimates. Estimate method of apop_model struct can thus assume the copy/prep step has been done; probably should not do these itself. As a side-effect, apop_maximum_likelihood's second argument is now a apop_model* (used to be apop_model). --apop_regex and apop_strcmp, for easier searching through your info pages. --minor rewording of COPYING2. **Because the Predicted table is now part of the parameter set, not the model, apop_estimate_coefficient_of_determination now takes in the parameter set, not the model. Just replace est in your argument with est->parameters. **apop_multinomial_probit folded into apop_probit, where it should've been all along. Regex for the fix: "s/apop_multinomial_probit/apop_probit/g" December 2009 --apop_strcmp --apop_loess model: 3,500 lines of code from the netlib archive, lovingly restored. November 2009 --apop_rm_columns bug fixed by Birger Baksaas. --apop_text_to_db attaches numeric affinity to sqlite3 columns, making numeric comparisons easier. --apop_histogram_model_reset's first argument is now "base" instead of "template", as a concession to C++ users. September 2009 --Many minor changes, mostly regarding adding optional arguments. --Dirichlet model --Output functions now take a consistent set of specs regarding to where they will write. You no longer have to use the global apop_opts settings if you don't want to. August 2009 --apop_map and apop_map_sum. Reworks the apop_(map|apply) system to be more flexible but a little more complex. -apop_(data|matrix|vector)_fill is now more robust---no more int vs float issues. **Removed apop_count_cols --Default univariate RNG, if you don't have one: Adaptive rejection markov chain sampling **.use_covar and other such settings now take 'y' or 'n', not 0 or 1. --new macro Apop_settings_set = Apop_settings_add, but makes more human sense --numeric covariance, formerly maligned, now works. July 2009 --multivariate gamma, log-gamma. --t, F, chi^2, Wishart distributions, for description [and Bayesian updating] --apop_matrix_to_positive_semidefinite and apop_matrix_is_positive_semidefinite --bug fixes --Apop_model_add_group replaces Apop_settings_add_group, and is much more easy to work with. June 2009 --More variadicized functions --notably, apop_estimate is much more useful --apop_opts.version. --apop_(vector|matrix|data)_stack have an inplace option, making stacking inside a for loop easier. --apop_test convenience function --more autoconf macros ==> some compilation hacks now done right May 2009 --mysql functions slightly cleaned up --apop_opts.db_user and apop_opts.db_pass for mysql. !!Functions that take lots of basically optional inputs, like apop_text_to_db, now use some designated initializer magic to let the user rearrange or omit inputs. **apop_dot also now allows designated initializers, which breaks (only) calls of the form apop_dot(a_vector, a_matrix, 't'). Replace with apop_dot(a_vector, a_matrix, 0, 't') or apop_dot(a_vector, a_matrix, .form2='t') --With optional inputs, some functions now handle RNGs for the lazy user ==>added apop_opts.rng_seed --apop_vector_distance is much more versatile **Removed apop_matrix_summarize. Too much like apop_data_summarize. Just replace every instance of apop_matrix_summarize(m) with apop_data_summarize(apop_matrix_to_data(m)). April 2009 --sample moments are now mega-accurate---possibly the most unbiased estimators in code today. !!Python interface via swig March 2009 --apop_matrix_realloc, apop_vector_realloc --sqlite queries no longer rely on a temp table ==> faster --fixed bugs in apop_table_exists making queries fail in Cygwin. Jan/Feb 2009 --Added more tests; some cleanup in test.c --Binomial distribution looks in both the data set's vector and matrix December 2008 --When writing x=infinity to a db table, I now write 'inf' to the db, instead of breaking. SQLite has no standard here. October 2008 --bug fixes to new apop_data_show --bug fixes to apop_bernoulli.p --apop_update tweaks September 2008 --Documentation overhaul --apop_data_show is much more screen-friendly. Keep using apop_data_print for more machine-readable and less fixed-width output. **apop_plot_histogram now takes in a vector, bin count, and name of output. This is what it did in the first half of the year. The current version of apop_plot_histogram, which acts on a histogram model, is renamed to apop_histogram_plot. August 2008 --Constraints in ML work better. **Overhaul of some discrete choice models --Added tests for the probit and logit. --fixed a bug revealed by the tests **the first choice has a fixed value of zero. **You'll probably need to call Apop_category_settings_add before estimating your model, unless the outcome choice variable is the 0th column of the matrix. --more to come, e.g., multinomial probit will be merged with ordered probit. -Adding a settings group of a given type when that group already exists used to induce an error; now the old type is replaced with a clean default. --bug fix for apop_test_fisher_exact on non-square matrices --apop_settings_add and company do more work in functions and less in macros. --removed the settings_ct element of the apop_model; using a sentinel at the end of the array instead. --Slightly improved reading of text files. --Bootstrap/jackknife act on models with parameters in both matrix and vector form. July 2008 --Guts of apop_plot_histogram now use the apop_histogram model. ** Also, it no longer normalizes the histogram to integrate to one by default. You need to explicitly request this via apop_histogram_normalize --apop_plot finally deleted. --apop_histogram_plot deleted; use apop_plot_histogram. --Added apop_vector_skew_sample, apop_vector_kurtosis_sample. May 2008 --apop_settings_rm_group added. --mysql interface has the beginnings of support for multiple semicolon-separated queries in one call. --apop_histogram_refill_with_model ==> apop_histogram_model_reset; apop_histogram_refill_with_vector ==> apop_histogram_vector_reset. April 2008 --apop_dot handles names. --apop_t_test now behaves correctly when one vector is of length 1. --some improvements/fixes when dealing with mySQL. --apop_sv_decomp renamed to apop_matrix_pca. Minor changes so that it correctly works as such. --apop_text_to_(db|data) handles column names like it used to, which works better. Also a few other fixes for odd situations. March 2008 --Various improvements in reading in from text. -- a, "b, c", d will now correctly read in as three elements: a; then "b, c"; then d -- a,,b,c reads as a, NAN, b, c. February 2008 --Some of the header references didn't work for a fresh install. --bug fixes, esp. with apop_test_kolmogorov --added convenience fn apop_data_transpose January 2008 --Apop_assert, which streamlines the use of apop_error (thus shrinking the code base by 2%). --apop_OLS now has a log likelihood (also shuffled some of the code around) --bug fix in apop_binomial.p. **More name reform: apop_correlation_matrix --> apop_matrix_correlation; apop_data_correlation_matrix --> apop_data_correlation; apop_covariance_matrix --> apop_matrix_covariance; apop_data_covariance_matrix --> apop_data_covariance. --apop_count_(rows|cols)_in_text are now static functions and removed from the documentation. --Removed apop_random_beta, which had been set as deprecated earlier. Use apop_beta_from_mean_var. **Removed apop_vector_isnan---just use apop_vector_map_sum(your_vector, isnan) **Removed apop_vector_finite---just use apop_vector_bounded(your_vector, INFINITY) --For your convenience, added Apop_settings_alloc() macro. **apop_histogram_params ==> apop_histogram_settings **apop_kernel_density_params ==> apop_kernel_density_settings [The settings/params distinction is in some ways arbitrary anyway.] --bug fix in apop_mle.c: wasn't copying output parameters to the estimated model in some cases. --As part of name reform, all function names are being switched to lower-case throughout, so apop_ANOVA ==> apop_anova. Am keeping the old forms via macros. Notice also the non-yelling macro capitalizations above, such as Apop_assert. !!**Revised the settings for the apop_model. model_settings and method_settings are out, replaced by a much more organized single list of settings. --added the apop_lookup command-line program. **Renamed the apop_multinomial_logit model the apop_logit, because the binary logit is a special case that requires no special handling. **Reversed the signs on the probit coefficients, to better conform to the norm. December 2007 --added apop_vector_moving_average --apop_model now has prep and print methods. **apop_p, apop_log_likelihood, apop_score now take a pointer to an apop_model, not the model itself. --apop_multinomial_probit model --When a data set has matrix and vector, apop_dot accepts a 'v' to use the vector. --apop_plot_lattice produces a more attractive (and standard-form) plot. --apop_data_print works much better now. **apop_model no longer requires your data input to be const. It probably will be const, but it's not the interface's place to dictate that. **apop_data_unpack no longer allocates a new data set, but writes to an input data set assumed to be of the right size. --added apop_ANOVA to produce one- or two-way ANOVA tables from the database. **apop_test_ANOVA renamed to apop_test_ANOVA_independence to create a little more cognitive distance. --apop_data_text_to_factors --APOP_COL_T and APOP_ROW_T macros, to pull a column or row by name --apop_beta_from_mean_var produces a beta-distributed model with the right (alpha, beta) parameters. **Thus, apop_random_beta is now marked as deprecated. **apop_x_prime_sigma_x removed, on grounds of being silly. If you want it back, see model/apop_multivariate_normal.c, where it is now a static function. **apop_qq_plot --> apop_plot_qq --Multinomial logit model (and the probit now has names). **Revised the bin-syncing methods. apop_vectors_to_histogram and apop_model_to_histogram are now out; apop_histogram_refill_with_vector and apop_histogram_refill_with_model are in. **Also removed apop_model_test_goodness_of_fit as redundant. Just produce a histogram, use the above refill functions, and send your two histograms to apop_histograms_test_goodness_of_fit. If you do this often, you can write a convenience function to do that as quickly as I could. **apop_vector_replace and apop_matrix_replace are redundant---just use apop_(vector|matrix)_apply. --The covariance matrix is now produced via the derivative of the score function at the parameter. I follow Efron and Hinkley in using the estimated information matrix---the value of the information matrix at the estimated value of the score---not the expected information matrix that is the integral over all possible data. November 2007 --added apop_model_copy_set_string to get a copy of a model whose model_settings is just a string. **Thanks to this, folded all of the _rank versions of models into their base models. Set model_settings to "r" to use the rank version. --The default MLE method is now the Nelder-Mead Simplex algorithm, instead of the Fletcher-Reeves conjugate gradient. This is more conservative. --apop_(vector|matrix|matrix_all)_map_sum to get the sum of a function applied to a vector. E.g., find count of NaNs with apop_vector_map_sum(v, isnan); --apop_logit bug fix. October 2007 --apop_estimate now defaults to using MLEs, meaning you don't have to explicitly specify an estimate method for MLE models. --apop_crosstab_to_db reads both the matrix and text elements of the input apop_data set. --apop_system convenience function, to make C feel more like a scripting language. --Added some SQLite functions for mySQL compatibility: var_samp, var_pop, stddev_samp, stddev_pop, std. --Probit patched to not NaN for very unlikely parameter/data combinations. **apop_estimate_restart takes two models, rather than one model and some haphazard settings. --apop_plot_query no longer forces you to use the -d and -q switches to specify the database and query. **The two places that use regular expressions: apop_opts.db_nan and the search for a name via apop_data_get/set... use case-insensitive EREs. Before I'd been using BREs, which nobody likes. -- stops the MLE searches, prints output, and continues the program. Especially useful for simulated annealing. [GDB tip: use the command: signal SIGINT ] --apop_mle_fix_params debugging: gradients work now. --apop_test_ANOVA added, to test the null hypothesis that all cells of a crosstab are equally likely. **apop_multivariate_normal_prob removed. Use the apop_multivariate_normal model and its .log_likelihood, .p, .draw, et cetera. **sed -i -e "s/apop_OLS_params/apop_ls_settings/g" -e "s/apop_mle_params/apop_mle_settings/g" *.c *.h September 2007 --The optimization methods now have an enumerated type. **apop_opts.mle_trace_path is now the trace_path element of the apop_mle_params struct. Also, it works much better. --apop_histogram_normalize function --improvements to apop_kernel_density and apop_histogram_print August 2007 --removed apop_model_template. Just copy one of the existing models. --apop_data_ptr_ti, apop_data_ptr_ii, apop_data_ptr_it --And you can never have too many bug fixes July 2007 --apop_binomial model takes two types of input now: a two-column form with hit count and miss count, and a list of binary hits or misses. --apop_lognormal model --bug fixes on Information matrix calculation. June 2007 [subversion ate this part; sorry.] May 2007 --apop_query_to_mixed_data **apop_produce_dummies now makes dummy variables from both data and text. This means that there's another parameter you need to set to 'd' or 't' to indicate what you want dummified. !!**Merged the apop_params and apop_model structures, leaving everything in just one struct. That's about all the merging left. --apop_text_alloc and apop_text_add to make text manipulation a little easier. --apop_matrix_apply_all and apop_matrix_map_all operate on all items in a matrix. **small tweak to apop_vector_normalize interface. --apop_matrix_inverse and apop_matrix_determinant, because the apop_det_and_inv interface is sort of ugly. --APOP_SUBMATRIX macro --MLEs put the expected score in the ->more element of the returned apop_model. If people find this useful, we can maybe put a proper expected_score element in the model. --More consts in function headers. You can decide whether this is actually useful. 0.19 April 2007 !!**Eliminated the apop_estimate and apop_ep structures, replacing them with the apop_params structure. The apop_params + apop_model pair form a closure representing a parametrized model. Expect the uses element to go away soon; after that, things should be stable. Parameters for individual methods now have their own space; try apop_ml_params_alloc and apop_OLS_params_alloc, for example. If you are just doing things like apop_estimate_show(apop_OLS.estimate(data,NULL)); then don't worry, but if you are doing a lot with the input parameters, then have a look at Chapter 6 of _Modeling with Data_. --apop_histogram model **Gradually rewriting the histogram functions from before to make use of that model. E.g., eliminated apop_vector_to_cmf. **apop_line_to_data fixed to use both vector and matrix terms. Now requires arguments: (indata, vsize, m1size, m2size). --MLE now approximates the Information matrix using data gathered during the MLE search. This is wrong but cheap; right but expensive procedures forthcoming. [Hint: Simulated annealing gathers more info.] --apop_(data|matrix|vector)_fill functions, which are a touch fragile, but very useful when used with care. **apop_data_(get|set)_(tn|nt) changed to (ti|it), because n could stand for name or number, while i stands for index, and is often used for integers. --apop_names now have a title element, so you can give your data structures a title. **apop_params_alloc takes an apop_model, not an &apop_model. It's more natural that way. **Finally erradicated every last vestige of inventories: apop_params no longer has a .uses element. Instead, apop_specific_model_params may have a want_cov, want_expected_value, want_whatever element if the element is optional. And really, the parameters themselves should never be optional. What was I thinking. **apop_model_fix_params now sets up and returns an apop_mle_params object, thus resolving the problem that the MLE params needs a model input, and the model_fix_params model needed an MLE params input. 0.18 March 2007 **apop_text_to_db now assumes column names unless you specify -nc. --If you set parameter_ct==0 in your model definition, the MLEs will assign parameter_ct == the number of columns in your data set. --Missing data functions: apop_listwise_delete and apop_ml_imputation **The constraint element of the apop_model now takes a void* parameter, like it should have all this time. **apop_jackknife (1) renamed to apop_jackknife_cov and (2) now actually works. **Entirely eliminated the apop_inventory structure. Its sole utility is inside the apop_ep struct. **Changed the RNG interface for the sake of allowing multidimensional draws. [Not that I have any functions that do that right now.] --Bayes-oriented MCMC algorithm: apop_mcmc_update !!Bayesian model generator: apop_update **apop_model paramters is now an apop_model. See the documentation of the model for all the changes. **apop_data_alloc now takes three arguments: vsize, msize1, msize2. To update, just put a 0, at the head of the arg list. !!An absolutely fabulous apop_linear_constraint function. --Produce a model with some parameters fixed via apop_model_fix_params. --apop_beta model February 2007 **Apop_sv_decomposition has a slightly nicer interface. **data->categories was too much to type, and too specific. The apop_data struct now has a data->text element, and textsize[0] and [1]. The categories element is linked to this, but is now deprecated. January 2007 **Root finding hooked into the max likelihood fns. 0.17 December 2006 **Apop_model struct has lost the fdf object, which was annoying, and now has the p function. --mySQL support. **apop_query_to_chars now returns an apop_data structure, so you don't have to go back and gather column names and dimensions. **apop_name_get (and the apop_get_tt family) now use regular expressions instead of SQL's LIKE operator. This is _much_ faster. --the apop_distribution model. November 2006 --The preeminently useful APOP_COL and APOP_ROW --apop_data_calloc --apop_vector_(apply|map) debugged. **apop_estimation_params is just too darn long; reduced to apop_ep. October 2006 --apop_text_to_db now reads from STDIN. **deleted apop_query_db; use apop_query. --Kolmogorov-Smirnov test. --apop_t_test returns GSL_NAN when given a one-element vector instead of hanging. --no more soft links in the tgz file==>may work better on Windows machines. --apop_(vector|matrix)_(map|apply) will apply a function to every row of a matrix or every element of a vector. The map functions return a gsl_vector. --bug fix in apop_test_goodness_of_fit. September 2006 **Removed apop command-line server thing. It was interesting, but that's the best that could be said of it. --Added functions for weighted data: weighted least squares, weighted moments. --apop_vector_percentiles now allows for averaging instead of rounding. August 2006 **apop_log_likelihood and friends now demand that data be apop_data*, rather than void*. Too many things broke when users gave non-apop_data data. --bug fixes July 2006 --Many more checks for NULL ==> more robust code and easier debugging. --Bug fixes. **apop_data_split, and apop_data_stack has been revised to handle the idea that a vector is the -1st element of the matrix. I.e., check your code if you're trying to merge matrices without merging the vectors. --Lattice plots. --Convenience t-tests from inside model estimations are fixed. --apop_query_to_vector. --apop_opts.output_type == 'p' to print to apop_opts.output_pipe **apop_..._print and apop_..._show now work out whether elements are integers (if (val == (int) val)...), and print accordingly. This means that apop_..._print_int and apop_..._show_int are basically obsolete, and have been removed. --Apop_OLS now allows weights. --Test library now includes a few NIST certified tests. June 2006 --added preprocessor cruft to let the library work for C++ --Jackknife revised May 2006 **The apop_model no longer includes an inventory. I leave it to the estimate function to do its own allocation. April 2006 **apop_matrix_normalize and apop_vector_normalize had different numbers for the same normalizations. Was that ever dumb. Also, I've switched to chars instead of ints to signify this stuff, for better mnemonics without resorting to the APOP_ENUM_YOU_HAVE_TO_LOOK_UP_EVERY_TIME_BECAUSE_ITS_SO_LONG sort of thing. If you were using apop_matrix_normalize(data, 0) before, you need to change that to using apop_matrix_normalize(data, 'm'). Thus, apop_vector_normalize now has one more normalization, for a total of four for both. --Added apop_rng_alloc convenience fn. --Added apop_strip_dots to keep inputs to the database healthy. --apop_name_find uses LIKE instead of strcmp. --a fn to calculate the generalized harmonic. --A whole section on histograms and goodness-of-fit tests. --apop_data_set fns to go with the apop_data_get fns. --apop_data now includes a vector type --apop_estimate.parameters is now an apop_data type. --apop_estimate.names is thus obsolete. March 2006 **apop_inventory is now a subset of apop_estimation_params. Implications: --added apop_estimation_params_alloc() to ensure that inventory is set right. --the model.estimate(data, inv, params) method is now model.estimate(data, params) model.estimate(data, NULL) still does what the user expects it to. This makes structural sense, but will lightly break any existing code. fix: change apop_inventory *inv = apop_inventory_set(1); model.estimate(data, inv, NULL); to apop_estimation_params *ep = apop_estimation_params_alloc(); model.estimate(data, ep); and in any apop_estimates, change any use of est->uses to est.estimation_params.uses. **Next apop_estimate reform: y_values and residuals combined into one apop_data table with actual, predicted, residual columns. --obviated the need for a 'dependent' element in apop_names; removed that. If you need the name, it's now your_est->dependent->names->colnames[0]. **your_estimate->covariance is now an apop_data set instead of a gsl_matrix. **the data element of the apop_matrix structure is now named matrix. So instead of data_set->data, use data_set->matrix, and instead of estimate->data->data->data, you can use estimate->data->matrix->data. --The command-line utility has been revisited, and can do a few more things, like OLS. --Simulated annealing --added convenience fns apop_vector_distance(pt1,pt2) and apop_vector_grid_distance(pt1,pt2) **Apop_data_memcpy no longer malloc()s for you, for comparability with the world's other memcpy fns. If you want mallocing, use apop_data_copy. --apop_test_fisher_exact(). Cut 'n' pasted from R, who cut 'n' pasted it from somebody else. Despite being the same code, it runs fifty (50) times faster from Apophenia. February 2006 --sort-of-adaptive MLE: use apop_estimate_restart to execute a new MLE search beginning where the last one ended, perhaps using a new method or rescaled parameters. --This needed convenience functions to check for divergence, thus added apop_vector_finite, apop_vector_bounded, apop_vector_isnan. **apop_db_to_crosstab now returns an apop_data set instead of a gsl_matrix. Also, it finally works with column headers that aren't numeric. **stats like apop_mean are now apop_vector_mean, following the proper pkg-noun-verb naming scheme. --Textbook is much improved. --apop_vector_to_pdf convenience fn. --Some of the fns that used to be of the form apop_get_something(input, &output); are now of the more natural out = apop_get_something(input); This includes apop_array_to_vector and apop_array_to_matrix --bootstrapping works, and works with with apop_models. --apop_poisson model 0.15 January 2006 Added an apop_opts structure for options. Alowed the following changes: --apop_verbose is now apop_opts.verbose. Try this on your existing code: perl -pi.bak -e 's/apop_verbose/apop_opts.verbose/g' *.c *.h --the output functions now output into three formats: on screen, to file, to db; see chapter five of the manual. --F tests --R squared. 0.14 December 2005 The apop_data structure, which is just a shell for a gsl_matrix and an apop_name. Was just sick of sending names following around my tables. Lets us keep both numerical and categorical data in one place; kind of like R's data frame. --Added linear model objects: OLS, GLS. This means that what had been the apop_OLS function is now the apop_estimate_OLS function, and where it used to take in a gsl_matrix and an apop_name, now it takes an apop_data structure and a NULL. So you'll have to modify your code accordingly. --A function to generate dummy variables, useful in conjunction with --Functions to stack matrices and apop_data sets. Even a apop_partitioned_OLS function, that will only practically work for small data sets. --pow(.,.) in the database. I can't believe I dealt with SQL this long w/o it. 0.13 December 2005 --The apop_model object. This was a big deal that deserves more than just one line; see the manuals. 0.12 mid November 2005 --Bar charts (assuming you've got Gnuplot > 4.1) --percentiles (in case you haven't got it) **redid MLE system so you can pick among the many options now available. As a part of this: --better handling of constraints. --numerical gradients. --numerical Hessians. 0.12 early November 2005, post-hiatus --You now have three maximum likelihood estimators to choose from: the GSL's no-gradient, the GSL's with-gradient, and Mr. WN's autocalculated gradient. --If you haven't seen it before, the apop_distribution structure is increasingly well-supported. It allows the user to specify the features of the Max. Likeihood model in a consistent manner which facilitates things like comparing two models. --I'd still suggest taking the Waring and Yule distributions with care; everything else seems to check out. 0.12 September 2005 **The distributions are now objects, which just provides a neat way of grouping together the half-dozen functions which are associated with any one distribution. 0.11 September 2005 --command-line server is much improved. I actually do work with it. --Documentation is now via doxygen. --asst bug fixes. --Have started to take plotting (via gnuplot) seriously --a limited test suite. Try: make test . 0.10 August 2005 --This version includes a server to park itself in memory and receive data processing requests. The intent is that one can then do analysis from the command line or a Perl/Python/Whatever script. The client/server works in the sense of handling a handful of requests without segfaulting, but remains in proof-of-concept stage. --Added apop_merge_db for joining databases, both via C and command line --Run t tests from the cmd line or the database. 0.09 July 2005 --Flattened the relatively complex vasprintf subsystem from GNU, so if you've been having trouble compiling on non-GNU systems, try again. Added two little command-line programs. Also, added more little functions which aren't very interesting, like t-tests; maybe you'll stumble upon them. 0.08 May 2005 --OLS/GLS/MLE now properly support the apop_estimate structure --Column names 0.07 April 2005 --uses the apop_estimate structure to return heaps of data from regressions & MLEs --uses the apop_model structure 0.06 --var(x), skew(x), kurtosis(x) added to SQL understood by Apophenia. 0.05 --added a little crosstab utility --queries now accept printf-type arguments. ==>GNU vasprintf was added. ==>updated to work with autoconf 1.7 apophenia-1.0+ds/README000066400000000000000000000046761262736346100146270ustar00rootroot00000000000000Apophenia is an open statistical library for working with data sets and statistical or simulation models. It provides functions on the same level as those of the typical stats package (such as OLS, probit, or singular value decomposition) but gives the user more flexibility to be creative in model-building. Being in C, it is often an order of magnitude faster when searching for optima or running MCMC chains. The core functions are written in C, but experience has shown them to be easy to bind to Python/Julia/Perl/Ruby/&c. http://apophenia.info/gentle.html provides an overview of the basics of using the library. If you want to know more about the package, see the web site, http://apophenia.info, or have a look at the textbook from Princeton University Press that coevolved with Apophenia, downloadable from http://modelingwithdata.org . The quick summary for installation: ∙ The library depends on the GNU Scientific Library and SQLite3. If you are using a system with a package manager of some sort, there is certainly a package for them. Be sure to include both the main package and the lib-, -dev, or -devel package. Sample package manager calls: sudo apt-get install make gcc libgsl0-dev libsqlite3-dev or sudo yum install make gcc gsl-devel libsqlite3x-devel or sudo pacman -S make gcc gsl sqlite ∙ The prebuilt package, that has only basic prerequisites (no Autotools or m4) can be downloaded from another Git branch: #Download the zip file, via wget or your preferred downloading method: wget https://github.com/b-k/Apophenia/archive/pkg.zip #unzip and build unzip pkg.zip cd Apophenia-pkg ./configure make sudo make install Or check out the branch via git: git clone https://github.com/b-k/Apophenia.git cd Apophenia git checkout pkg ./configure make sudo make install ∙ This master branch of the git repository requires Autotools, so it can build the package. Try (apt-get || yum install) autoconf automake libtool. If you have Autotools installed, then from this branch you can run: ./configure cd apophenia-1.0 make sudo make install ∙ Find detailed setup instructions and some troubleshooting notes at http://apophenia.info/setup.html . Thanks for your interest. I do hope that Apophenia helps you learn more from your data. --BK PS: Lawyers, please note that a file named COPYING in the install/ directory describes how this package is licensed under GPLv2. apophenia-1.0+ds/apop.h.in000066400000000000000000003307061262736346100154600ustar00rootroot00000000000000 /** \file */ /* Copyright (c) 2005--2014 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ /* Here are the headers for all of apophenia's functions, typedefs, static variables and macros. All of these begin with the apop_ (or Apop_ or APOP_) prefix. There used to be a series of sub-headers, but they never provided any serious benefit. Please use your text editor's word-search feature to find any elements you may be looking for. About a third of the file is comments and doxygen documentation, so syntax highlighting that distinguishes code from comments will also help to make this more navigable.*/ /** \defgroup all_public Public functions, structs, and types \addtogroup all_public @{ */ #pragma once #ifdef __cplusplus extern "C" { #endif /** \cond doxy_ignore */ #ifndef _GNU_SOURCE #define _GNU_SOURCE //for asprintf #endif #include #include //raise(SIGTRAP) #include #include #include //////Optional arguments /* A means of providing more script-like means of sending arguments to a function. These macros are intended as internal. If you are interested in using this mechanism in out-of-Apophenia work, grep docs/documentation.h for optionaldetails to find notes on how these are used (Doxygen doesn't use that page), */ #define apop_varad_head(type, name) type variadic_##name(variadic_type_##name varad_in) #define apop_varad_declare(type, name, ...) \ typedef struct { \ __VA_ARGS__ ; \ } variadic_type_##name; \ apop_varad_head(type, name); #define apop_varad_var(name, value) name = varad_in.name ? varad_in.name : (value); #define apop_varad_link(name,...) variadic_##name((variadic_type_##name) {__VA_ARGS__}) /** \endcond */ //End of Doxygen ignore. //////The types and functions that act on them /** This structure holds the names of the components of the \ref apop_data set. You may never have to worry about it directly, because most operations on \ref apop_data sets will take care of the names for you. */ typedef struct{ char *title; char * vector; char ** col; char ** row; char ** text; int colct, rowct, textct; } apop_name; /** The \ref apop_data structure represents a data set. See \ref dataoverview.*/ typedef struct apop_data{ gsl_vector *vector; gsl_matrix *matrix; apop_name *names; char ***text; size_t textsize[2]; gsl_vector *weights; struct apop_data *more; char error; } apop_data; /* Settings groups. For internal use only; see apop_settings.c and settings.h for related machinery. */ typedef struct { char name[101]; unsigned long name_hash; void *setting_group; void *copy; void *free; } apop_settings_type; /** A statistical model. See \ref modelsec for details. */ typedef struct apop_model apop_model; /** The elements of the \ref apop_model type, representing a statistical model. See \ref modelsec and \ref modeldetails for use and details. */ struct apop_model{ char name[101]; int vsize, msize1, msize2, dsize; apop_data *data; apop_data *parameters; apop_data *info; void (*estimate)(apop_data * data, apop_model *params); long double (*p)(apop_data *d, apop_model *params); long double (*log_likelihood)(apop_data *d, apop_model *params); long double (*cdf)(apop_data *d, apop_model *params); long double (*constraint)(apop_data *data, apop_model *params); int (*draw)(double *out, gsl_rng* r, apop_model *params); void (*prep)(apop_data *data, apop_model *params); apop_settings_type *settings; void *more; size_t more_size; char error; }; /** The global options. */ typedef struct{ int verbose; /**< Set this to zero for silent mode, one for errors and warnings. default = 0. */ char stop_on_warning; /**< See \ref debugging . */ char output_delimiter[100]; /**< The separator between elements of output tables. The default is "\t", but for LaTeX, use "&\t", or use "|" to get pipe-delimited output. */ char input_delimiters[100]; /**< Deprecated. Please use per-function inputs to \ref apop_text_to_db and \ref apop_text_to_data. Default = "|,\t" */ char *db_name_column; /**< If not NULL or "", the name of the column in your tables that holds row names.*/ char *nan_string; /**< The string used to indicate NaN. Default: "NaN. Comparisons are case-insensitive.*/ char db_engine; /**< If this is 'm', use mySQL, else use SQLite. */ char db_user[101]; /**< Username for database login. Max 100 chars. */ char db_pass[101]; /**< Password for database login. Max 100 chars. */ FILE *log_file; /**< The file handle for the log. Defaults to \c stderr, but change it with, e.g., apop_opts.log_file = fopen("outlog", "w"); */ #define Autoconf_no_atomics @Autoconf_no_atomics@ #if __STDC_VERSION__ > 201100L && !defined(__STDC_NO_ATOMICS__) && Autoconf_no_atomics==0 _Atomic(int) rng_seed; #else int rng_seed; #endif float version; } apop_opts_type; extern apop_opts_type apop_opts; apop_name * apop_name_alloc(void); int apop_name_add(apop_name * n, char const *add_me, char type); void apop_name_free(apop_name * free_me); void apop_name_print(apop_name * n); #ifdef APOP_NO_VARIADIC void apop_name_stack(apop_name * n1, apop_name *nadd, char type1, char typeadd) ; #else void apop_name_stack_base(apop_name * n1, apop_name *nadd, char type1, char typeadd) ; apop_varad_declare(void, apop_name_stack, apop_name * n1; apop_name *nadd; char type1; char typeadd); #define apop_name_stack(...) apop_varad_link(apop_name_stack, __VA_ARGS__) #endif apop_name * apop_name_copy(apop_name *in); int apop_name_find(const apop_name *n, const char *findme, const char type); void apop_data_add_names_base(apop_data *d, const char type, char const ** names); /** Add a list of names to a data set. \li Use this with a list of names that you type in yourself, like \code apop_data_add_names(mydata, 'c', "age", "sex", "height"); \endcode Notice the lack of curly braces around the list. \li You may have an array of names, probably autogenerated, that you would like to add. In this case, make certain that the last element of the array is \c NULL, and call the base function: \code char **[] colnames = {"age", "sex", "height", NULL}; apop_data_add_names_base(mydata, 'c', colnames); \endcode But if you forget the \c NULL marker, this has good odds of segfaulting. You may prefer to use a \c for loop that inserts each name in turn using \ref apop_name_add. \see \ref apop_name_add, although \ref apop_data_add_names will be more useful in most cases. */ #define apop_data_add_names(dataset, type, ...) apop_data_add_names_base((dataset), (type), (char const*[]) {__VA_ARGS__, NULL}) /** Free an \ref apop_data structure. \li As with \c free(), it is safe to send in a \c NULL pointer (in which case the function does nothing). \li If the \c more pointer is not \c NULL, I will free the pointed-to data set first. If you don't want to free data sets down the chain, set more=NULL before calling this. \li This is actually a macro (that calls \ref apop_data_free_base). It sets \c freeme to \c NULL when it's done, because there's nothing safe you can do with the freed location, and you can later safely test conditions like if (data) .... */ #define apop_data_free(freeme) (apop_data_free_base(freeme) ? 0 : ((freeme)= NULL)) char apop_data_free_base(apop_data *freeme); #ifdef APOP_NO_VARIADIC apop_data * apop_data_alloc(const size_t size1, const size_t size2, const int size3) ; #else apop_data * apop_data_alloc_base(const size_t size1, const size_t size2, const int size3) ; apop_varad_declare(apop_data *, apop_data_alloc, const size_t size1; const size_t size2; const int size3); #define apop_data_alloc(...) apop_varad_link(apop_data_alloc, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_data_calloc(const size_t size1, const size_t size2, const int size3) ; #else apop_data * apop_data_calloc_base(const size_t size1, const size_t size2, const int size3) ; apop_varad_declare(apop_data *, apop_data_calloc, const size_t size1; const size_t size2; const int size3); #define apop_data_calloc(...) apop_varad_link(apop_data_calloc, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_data_stack(apop_data *m1, apop_data * m2, char posn, char inplace) ; #else apop_data * apop_data_stack_base(apop_data *m1, apop_data * m2, char posn, char inplace) ; apop_varad_declare(apop_data *, apop_data_stack, apop_data *m1; apop_data * m2; char posn; char inplace); #define apop_data_stack(...) apop_varad_link(apop_data_stack, __VA_ARGS__) #endif apop_data ** apop_data_split(apop_data *in, int splitpoint, char r_or_c); apop_data * apop_data_copy(const apop_data *in); void apop_data_rm_columns(apop_data *d, int *drop); void apop_data_memcpy(apop_data *out, const apop_data *in); #ifdef APOP_NO_VARIADIC double * apop_data_ptr(apop_data *data, int row, int col, const char *rowname, const char *colname, const char *page) ; #else double * apop_data_ptr_base(apop_data *data, int row, int col, const char *rowname, const char *colname, const char *page) ; apop_varad_declare(double *, apop_data_ptr, apop_data *data; int row; int col; const char *rowname; const char *colname; const char *page); #define apop_data_ptr(...) apop_varad_link(apop_data_ptr, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_data_get(const apop_data *data, size_t row, int col, const char *rowname, const char *colname, const char *page) ; #else double apop_data_get_base(const apop_data *data, size_t row, int col, const char *rowname, const char *colname, const char *page) ; apop_varad_declare(double, apop_data_get, const apop_data *data; size_t row; int col; const char *rowname; const char *colname; const char *page); #define apop_data_get(...) apop_varad_link(apop_data_get, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC int apop_data_set(apop_data *data, size_t row, int col, const double val, const char *rowname, const char * colname, const char *page) ; #else int apop_data_set_base(apop_data *data, size_t row, int col, const double val, const char *rowname, const char * colname, const char *page) ; apop_varad_declare(int, apop_data_set, apop_data *data; size_t row; int col; const double val; const char *rowname; const char * colname; const char *page); #define apop_data_set(...) apop_varad_link(apop_data_set, __VA_ARGS__) #endif void apop_data_add_named_elmt(apop_data *d, char *name, double val); int apop_text_set(apop_data *in, const size_t row, const size_t col, const char *fmt, ...); apop_data * apop_text_alloc(apop_data *in, const size_t row, const size_t col); void apop_text_free(char ***freeme, int rows, int cols); #ifdef APOP_NO_VARIADIC apop_data * apop_data_transpose(apop_data *in, char transpose_text, char inplace) ; #else apop_data * apop_data_transpose_base(apop_data *in, char transpose_text, char inplace) ; apop_varad_declare(apop_data *, apop_data_transpose, apop_data *in; char transpose_text; char inplace); #define apop_data_transpose(...) apop_varad_link(apop_data_transpose, __VA_ARGS__) #endif gsl_matrix * apop_matrix_realloc(gsl_matrix *m, size_t newheight, size_t newwidth); gsl_vector * apop_vector_realloc(gsl_vector *v, size_t newheight); #define apop_data_prune_columns(in, ...) apop_data_prune_columns_base((in), (char *[]) {__VA_ARGS__, NULL}) apop_data* apop_data_prune_columns_base(apop_data *d, char **colnames); #ifdef APOP_NO_VARIADIC apop_data * apop_data_get_page(const apop_data * data, const char * title, const char match) ; #else apop_data * apop_data_get_page_base(const apop_data * data, const char * title, const char match) ; apop_varad_declare(apop_data *, apop_data_get_page, const apop_data * data; const char * title; const char match); #define apop_data_get_page(...) apop_varad_link(apop_data_get_page, __VA_ARGS__) #endif apop_data * apop_data_add_page(apop_data * dataset, apop_data *newpage,const char *title); #ifdef APOP_NO_VARIADIC apop_data* apop_data_rm_page(apop_data * data, const char *title, const char free_p) ; #else apop_data* apop_data_rm_page_base(apop_data * data, const char *title, const char free_p) ; apop_varad_declare(apop_data*, apop_data_rm_page, apop_data * data; const char *title; const char free_p); #define apop_data_rm_page(...) apop_varad_link(apop_data_rm_page, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_data_rm_rows(apop_data *in, int *drop, int (*do_drop)(apop_data* , void*), void* drop_parameter) ; #else apop_data * apop_data_rm_rows_base(apop_data *in, int *drop, int (*do_drop)(apop_data* , void*), void* drop_parameter) ; apop_varad_declare(apop_data *, apop_data_rm_rows, apop_data *in; int *drop; int (*do_drop)(apop_data* , void*); void* drop_parameter); #define apop_data_rm_rows(...) apop_varad_link(apop_data_rm_rows, __VA_ARGS__) #endif //in apop_asst.c: #ifdef APOP_NO_VARIADIC apop_data * apop_model_draws(apop_model *model, int count, apop_data *draws) ; #else apop_data * apop_model_draws_base(apop_model *model, int count, apop_data *draws) ; apop_varad_declare(apop_data *, apop_model_draws, apop_model *model; int count; apop_data *draws); #define apop_model_draws(...) apop_varad_link(apop_model_draws, __VA_ARGS__) #endif /* Convenience functions to convert among vectors (gsl_vector), matrices (gsl_matrix), arrays (double **), and database tables */ //From vector gsl_vector *apop_vector_copy(const gsl_vector *in); #ifdef APOP_NO_VARIADIC gsl_matrix * apop_vector_to_matrix(const gsl_vector *in, char row_col) ; #else gsl_matrix * apop_vector_to_matrix_base(const gsl_vector *in, char row_col) ; apop_varad_declare(gsl_matrix *, apop_vector_to_matrix, const gsl_vector *in; char row_col); #define apop_vector_to_matrix(...) apop_varad_link(apop_vector_to_matrix, __VA_ARGS__) #endif //From matrix gsl_matrix *apop_matrix_copy(const gsl_matrix *in); #ifdef APOP_NO_VARIADIC apop_data *apop_db_to_crosstab(char const*tabname, char const*row, char const*col, char const*data, char is_aggregate) ; #else apop_data * apop_db_to_crosstab_base(char const*tabname, char const*row, char const*col, char const*data, char is_aggregate) ; apop_varad_declare(apop_data *, apop_db_to_crosstab, char const*tabname; char const*row; char const*col; char const*data; char is_aggregate); #define apop_db_to_crosstab(...) apop_varad_link(apop_db_to_crosstab, __VA_ARGS__) #endif //From array #ifdef APOP_NO_VARIADIC gsl_vector * apop_array_to_vector(double *in, int size) ; #else gsl_vector * apop_array_to_vector_base(double *in, int size) ; apop_varad_declare(gsl_vector *, apop_array_to_vector, double *in; int size); #define apop_array_to_vector(...) apop_varad_link(apop_array_to_vector, __VA_ARGS__) #endif /** \cond doxy_ignore */ //Deprecated #define apop_text_add apop_text_set #define apop_line_to_vector apop_array_to_vector /** \endcond */ //From text #ifdef APOP_NO_VARIADIC apop_data * apop_text_to_data(char const *text_file, int has_row_names, int has_col_names, int const *field_ends, char const *delimiters) ; #else apop_data * apop_text_to_data_base(char const *text_file, int has_row_names, int has_col_names, int const *field_ends, char const *delimiters) ; apop_varad_declare(apop_data *, apop_text_to_data, char const *text_file; int has_row_names; int has_col_names; int const *field_ends; char const *delimiters); #define apop_text_to_data(...) apop_varad_link(apop_text_to_data, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC int apop_text_to_db(char const *text_file, char *tabname, int has_row_names, int has_col_names, char **field_names, int const *field_ends, apop_data *field_params, char *table_params, char const *delimiters, char if_table_exists) ; #else int apop_text_to_db_base(char const *text_file, char *tabname, int has_row_names, int has_col_names, char **field_names, int const *field_ends, apop_data *field_params, char *table_params, char const *delimiters, char if_table_exists) ; apop_varad_declare(int, apop_text_to_db, char const *text_file; char *tabname; int has_row_names; int has_col_names; char **field_names; int const *field_ends; apop_data *field_params; char *table_params; char const *delimiters; char if_table_exists); #define apop_text_to_db(...) apop_varad_link(apop_text_to_db, __VA_ARGS__) #endif //rank data apop_data *apop_data_rank_expand (apop_data *in); #ifdef APOP_NO_VARIADIC apop_data *apop_data_rank_compress (apop_data *in, int min_bins) ; #else apop_data * apop_data_rank_compress_base(apop_data *in, int min_bins) ; apop_varad_declare(apop_data *, apop_data_rank_compress, apop_data *in; int min_bins); #define apop_data_rank_compress(...) apop_varad_link(apop_data_rank_compress, __VA_ARGS__) #endif //From crosstabs void apop_crosstab_to_db(apop_data *in, char *tabname, char *row_col_name, char *col_col_name, char *data_col_name); //packing data into a vector #ifdef APOP_NO_VARIADIC gsl_vector * apop_data_pack(const apop_data *in, gsl_vector *out, char more_pages, char use_info_pages) ; #else gsl_vector * apop_data_pack_base(const apop_data *in, gsl_vector *out, char more_pages, char use_info_pages) ; apop_varad_declare(gsl_vector *, apop_data_pack, const apop_data *in; gsl_vector *out; char more_pages; char use_info_pages); #define apop_data_pack(...) apop_varad_link(apop_data_pack, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC void apop_data_unpack(const gsl_vector *in, apop_data *d, char use_info_pages) ; #else void apop_data_unpack_base(const gsl_vector *in, apop_data *d, char use_info_pages) ; apop_varad_declare(void, apop_data_unpack, const gsl_vector *in; apop_data *d; char use_info_pages); #define apop_data_unpack(...) apop_varad_link(apop_data_unpack, __VA_ARGS__) #endif #define apop_vector_fill(avfin, ...) apop_vector_fill_base((avfin), (double []) {__VA_ARGS__}) #define apop_data_fill(adfin, ...) apop_data_fill_base((adfin), (double []) {__VA_ARGS__}) #define apop_text_fill(dataset, ...) apop_text_fill_base((dataset), (char* []) {__VA_ARGS__, NULL}) #define apop_data_falloc(sizes, ...) apop_data_fill(apop_data_alloc sizes, __VA_ARGS__) apop_data *apop_data_fill_base(apop_data *in, double []); gsl_vector *apop_vector_fill_base(gsl_vector *in, double []); apop_data *apop_text_fill_base(apop_data *data, char* text[]); //// Models and model support functions extern apop_model *apop_beta; extern apop_model *apop_bernoulli; extern apop_model *apop_binomial; extern apop_model *apop_chi_squared; extern apop_model *apop_dirichlet; extern apop_model *apop_exponential; extern apop_model *apop_f_distribution; extern apop_model *apop_gamma; extern apop_model *apop_improper_uniform; extern apop_model *apop_iv; extern apop_model *apop_kernel_density; extern apop_model *apop_loess; extern apop_model *apop_logit; extern apop_model *apop_lognormal; extern apop_model *apop_multinomial; extern apop_model *apop_multivariate_normal; extern apop_model *apop_normal; extern apop_model *apop_ols; extern apop_model *apop_pmf; extern apop_model *apop_poisson; extern apop_model *apop_probit; extern apop_model *apop_t_distribution; extern apop_model *apop_uniform; //extern apop_model *apop_wishart; extern apop_model *apop_wls; extern apop_model *apop_yule; extern apop_model *apop_zipf; //model transformations extern apop_model *apop_coordinate_transform; extern apop_model *apop_composition; extern apop_model *apop_dconstrain; extern apop_model *apop_mixture; extern apop_model *apop_cross; /** Alias for the \ref apop_normal distribution, qv. */ #define apop_gaussian apop_normal #define apop_OLS apop_ols #define apop_PMF apop_pmf #define apop_F_distribution apop_f_distribution #define apop_IV apop_iv void apop_model_free (apop_model * free_me); #ifdef APOP_NO_VARIADIC void apop_model_print (apop_model * model, FILE *output_pipe) ; #else void apop_model_print_base(apop_model * model, FILE *output_pipe) ; apop_varad_declare(void, apop_model_print, apop_model * model; FILE *output_pipe); #define apop_model_print(...) apop_varad_link(apop_model_print, __VA_ARGS__) #endif void apop_model_show (apop_model * print_me); //deprecated apop_model * apop_model_copy(apop_model *in); //in apop_model.c apop_model * apop_model_clear(apop_data * data, apop_model *model); apop_model * apop_estimate(apop_data *d, apop_model *m); void apop_score(apop_data *d, gsl_vector *out, apop_model *m); double apop_log_likelihood(apop_data *d, apop_model *m); double apop_p(apop_data *d, apop_model *m); double apop_cdf(apop_data *d, apop_model *m); int apop_draw(double *out, gsl_rng *r, apop_model *m); void apop_prep(apop_data *d, apop_model *m); apop_model *apop_parameter_model(apop_data *d, apop_model *m); apop_data * apop_predict(apop_data *d, apop_model *m); apop_model *apop_beta_from_mean_var(double m, double v); //in apop_beta.c #define apop_model_set_parameters(in, ...) apop_model_set_parameters_base((in), (double []) {__VA_ARGS__}) apop_model *apop_model_set_parameters_base(apop_model *in, double ap[]); //apop_mixture.c /** Produce a model as a linear combination of other models. See the documentation for the \ref apop_mixture model. \param ... A list of models, either all parameterized or all unparameterized. See examples in the \ref apop_mixture documentation. */ #define apop_model_mixture(...) apop_model_mixture_base((apop_model *[]){__VA_ARGS__, NULL}) apop_model *apop_model_mixture_base(apop_model **inlist); //transform/apop_cross.c. apop_model *apop_model_cross_base(apop_model *mlist[]); #define apop_model_cross(...) apop_model_cross_base((apop_model *[]){__VA_ARGS__, NULL}) ////More functions //The variadic versions, with lots of options to input extra parameters to the //function being mapped/applied #ifdef APOP_NO_VARIADIC apop_data * apop_map(apop_data *in, double (*fn_d)(double), double (*fn_v)(gsl_vector*), double (*fn_r)(apop_data *), double (*fn_dp)(double, void *), double (*fn_vp)(gsl_vector*, void *), double (*fn_rp)(apop_data *, void *), double (*fn_dpi)(double, void *, int), double (*fn_vpi)(gsl_vector*, void *, int), double (*fn_rpi)(apop_data*, void *, int), double (*fn_di)(double, int), double (*fn_vi)(gsl_vector*, int), double (*fn_ri)(apop_data*, int), void *param, int inplace, char part, int all_pages) ; #else apop_data * apop_map_base(apop_data *in, double (*fn_d)(double), double (*fn_v)(gsl_vector*), double (*fn_r)(apop_data *), double (*fn_dp)(double, void *), double (*fn_vp)(gsl_vector*, void *), double (*fn_rp)(apop_data *, void *), double (*fn_dpi)(double, void *, int), double (*fn_vpi)(gsl_vector*, void *, int), double (*fn_rpi)(apop_data*, void *, int), double (*fn_di)(double, int), double (*fn_vi)(gsl_vector*, int), double (*fn_ri)(apop_data*, int), void *param, int inplace, char part, int all_pages) ; apop_varad_declare(apop_data *, apop_map, apop_data *in; double (*fn_d)(double); double (*fn_v)(gsl_vector*); double (*fn_r)(apop_data *); double (*fn_dp)(double, void *); double (*fn_vp)(gsl_vector*, void *); double (*fn_rp)(apop_data *, void *); double (*fn_dpi)(double, void *, int); double (*fn_vpi)(gsl_vector*, void *, int); double (*fn_rpi)(apop_data*, void *, int); double (*fn_di)(double, int); double (*fn_vi)(gsl_vector*, int); double (*fn_ri)(apop_data*, int); void *param; int inplace; char part; int all_pages); #define apop_map(...) apop_varad_link(apop_map, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_map_sum(apop_data *in, double (*fn_d)(double), double (*fn_v)(gsl_vector*), double (*fn_r)(apop_data *), double (*fn_dp)(double, void *), double (*fn_vp)(gsl_vector*, void *), double (*fn_rp)(apop_data *, void *), double (*fn_dpi)(double, void *, int), double (*fn_vpi)(gsl_vector*, void *, int), double (*fn_rpi)(apop_data*, void *, int), double (*fn_di)(double, int), double (*fn_vi)(gsl_vector*, int), double (*fn_ri)(apop_data*, int), void *param, char part, int all_pages) ; #else double apop_map_sum_base(apop_data *in, double (*fn_d)(double), double (*fn_v)(gsl_vector*), double (*fn_r)(apop_data *), double (*fn_dp)(double, void *), double (*fn_vp)(gsl_vector*, void *), double (*fn_rp)(apop_data *, void *), double (*fn_dpi)(double, void *, int), double (*fn_vpi)(gsl_vector*, void *, int), double (*fn_rpi)(apop_data*, void *, int), double (*fn_di)(double, int), double (*fn_vi)(gsl_vector*, int), double (*fn_ri)(apop_data*, int), void *param, char part, int all_pages) ; apop_varad_declare(double, apop_map_sum, apop_data *in; double (*fn_d)(double); double (*fn_v)(gsl_vector*); double (*fn_r)(apop_data *); double (*fn_dp)(double, void *); double (*fn_vp)(gsl_vector*, void *); double (*fn_rp)(apop_data *, void *); double (*fn_dpi)(double, void *, int); double (*fn_vpi)(gsl_vector*, void *, int); double (*fn_rpi)(apop_data*, void *, int); double (*fn_di)(double, int); double (*fn_vi)(gsl_vector*, int); double (*fn_ri)(apop_data*, int); void *param; char part; int all_pages); #define apop_map_sum(...) apop_varad_link(apop_map_sum, __VA_ARGS__) #endif //the specific-to-a-type versions, quicker and easier when appropriate. gsl_vector *apop_matrix_map(const gsl_matrix *m, double (*fn)(gsl_vector*)); gsl_vector *apop_vector_map(const gsl_vector *v, double (*fn)(double)); void apop_matrix_apply(gsl_matrix *m, void (*fn)(gsl_vector*)); void apop_vector_apply(gsl_vector *v, void (*fn)(double*)); gsl_matrix * apop_matrix_map_all(const gsl_matrix *in, double (*fn)(double)); void apop_matrix_apply_all(gsl_matrix *in, void (*fn)(double *)); double apop_vector_map_sum(const gsl_vector *in, double(*fn)(double)); double apop_matrix_map_sum(const gsl_matrix *in, double (*fn)(gsl_vector*)); double apop_matrix_map_all_sum(const gsl_matrix *in, double (*fn)(double)); // Some output routines #ifdef APOP_NO_VARIADIC void apop_matrix_print(const gsl_matrix *data, char const *output_name, FILE *output_pipe, char output_type, char output_append) ; #else void apop_matrix_print_base(const gsl_matrix *data, char const *output_name, FILE *output_pipe, char output_type, char output_append) ; apop_varad_declare(void, apop_matrix_print, const gsl_matrix *data; char const *output_name; FILE *output_pipe; char output_type; char output_append); #define apop_matrix_print(...) apop_varad_link(apop_matrix_print, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC void apop_vector_print(gsl_vector *data, char const *output_name, FILE *output_pipe, char output_type, char output_append) ; #else void apop_vector_print_base(gsl_vector *data, char const *output_name, FILE *output_pipe, char output_type, char output_append) ; apop_varad_declare(void, apop_vector_print, gsl_vector *data; char const *output_name; FILE *output_pipe; char output_type; char output_append); #define apop_vector_print(...) apop_varad_link(apop_vector_print, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC void apop_data_print(const apop_data *data, char const *output_name, FILE *output_pipe, char output_type, char output_append) ; #else void apop_data_print_base(const apop_data *data, char const *output_name, FILE *output_pipe, char output_type, char output_append) ; apop_varad_declare(void, apop_data_print, const apop_data *data; char const *output_name; FILE *output_pipe; char output_type; char output_append); #define apop_data_print(...) apop_varad_link(apop_data_print, __VA_ARGS__) #endif void apop_matrix_show(const gsl_matrix *data); void apop_vector_show(const gsl_vector *data); void apop_data_show(const apop_data *data); //statistics #ifdef APOP_NO_VARIADIC double apop_vector_mean(gsl_vector const *v, gsl_vector const *weights); #else double apop_vector_mean_base(gsl_vector const *v, gsl_vector const *weights); apop_varad_declare(double, apop_vector_mean, gsl_vector const *v; gsl_vector const *weights); #define apop_vector_mean(...) apop_varad_link(apop_vector_mean, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_vector_var(gsl_vector const *v, gsl_vector const *weights); #else double apop_vector_var_base(gsl_vector const *v, gsl_vector const *weights); apop_varad_declare(double, apop_vector_var, gsl_vector const *v; gsl_vector const *weights); #define apop_vector_var(...) apop_varad_link(apop_vector_var, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_vector_skew_pop(gsl_vector const *v, gsl_vector const *weights); #else double apop_vector_skew_pop_base(gsl_vector const *v, gsl_vector const *weights); apop_varad_declare(double, apop_vector_skew_pop, gsl_vector const *v; gsl_vector const *weights); #define apop_vector_skew_pop(...) apop_varad_link(apop_vector_skew_pop, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_vector_kurtosis_pop(gsl_vector const *v, gsl_vector const *weights); #else double apop_vector_kurtosis_pop_base(gsl_vector const *v, gsl_vector const *weights); apop_varad_declare(double, apop_vector_kurtosis_pop, gsl_vector const *v; gsl_vector const *weights); #define apop_vector_kurtosis_pop(...) apop_varad_link(apop_vector_kurtosis_pop, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_vector_cov(gsl_vector const *v1, gsl_vector const *v2, gsl_vector const *weights); #else double apop_vector_cov_base(gsl_vector const *v1, gsl_vector const *v2, gsl_vector const *weights); apop_varad_declare(double, apop_vector_cov, gsl_vector const *v1; gsl_vector const *v2; gsl_vector const *weights); #define apop_vector_cov(...) apop_varad_link(apop_vector_cov, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_vector_distance(const gsl_vector *ina, const gsl_vector *inb, const char metric, const double norm) ; #else double apop_vector_distance_base(const gsl_vector *ina, const gsl_vector *inb, const char metric, const double norm) ; apop_varad_declare(double, apop_vector_distance, const gsl_vector *ina; const gsl_vector *inb; const char metric; const double norm); #define apop_vector_distance(...) apop_varad_link(apop_vector_distance, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC void apop_vector_normalize(gsl_vector *in, gsl_vector **out, const char normalization_type) ; #else void apop_vector_normalize_base(gsl_vector *in, gsl_vector **out, const char normalization_type) ; apop_varad_declare(void, apop_vector_normalize, gsl_vector *in; gsl_vector **out; const char normalization_type); #define apop_vector_normalize(...) apop_varad_link(apop_vector_normalize, __VA_ARGS__) #endif apop_data * apop_data_covariance(const apop_data *in); apop_data * apop_data_correlation(const apop_data *in); long double apop_vector_entropy(gsl_vector *in); long double apop_matrix_sum(const gsl_matrix *m); double apop_matrix_mean(const gsl_matrix *data); void apop_matrix_mean_and_var(const gsl_matrix *data, double *mean, double *var); apop_data * apop_data_summarize(apop_data *data); #ifdef APOP_NO_VARIADIC double * apop_vector_percentiles(gsl_vector *data, char rounding) ; #else double * apop_vector_percentiles_base(gsl_vector *data, char rounding) ; apop_varad_declare(double *, apop_vector_percentiles, gsl_vector *data; char rounding); #define apop_vector_percentiles(...) apop_varad_link(apop_vector_percentiles, __VA_ARGS__) #endif apop_data *apop_test_fisher_exact(apop_data *intab); //in apop_fisher.c //from apop_t_f_chi.c: #ifdef APOP_NO_VARIADIC int apop_matrix_is_positive_semidefinite(gsl_matrix *m, char semi) ; #else int apop_matrix_is_positive_semidefinite_base(gsl_matrix *m, char semi) ; apop_varad_declare(int, apop_matrix_is_positive_semidefinite, gsl_matrix *m; char semi); #define apop_matrix_is_positive_semidefinite(...) apop_varad_link(apop_matrix_is_positive_semidefinite, __VA_ARGS__) #endif double apop_matrix_to_positive_semidefinite(gsl_matrix *m); long double apop_multivariate_gamma(double a, int p); long double apop_multivariate_lngamma(double a, int p); //apop_tests.c apop_data * apop_t_test(gsl_vector *a, gsl_vector *b); apop_data * apop_paired_t_test(gsl_vector *a, gsl_vector *b); #ifdef APOP_NO_VARIADIC apop_data* apop_anova(char *table, char *data, char *grouping1, char *grouping2) ; #else apop_data* apop_anova_base(char *table, char *data, char *grouping1, char *grouping2) ; apop_varad_declare(apop_data*, apop_anova, char *table; char *data; char *grouping1; char *grouping2); #define apop_anova(...) apop_varad_link(apop_anova, __VA_ARGS__) #endif #define apop_ANOVA apop_anova #ifdef APOP_NO_VARIADIC apop_data * apop_f_test (apop_model *est, apop_data *contrast) ; #else apop_data * apop_f_test_base(apop_model *est, apop_data *contrast) ; apop_varad_declare(apop_data *, apop_f_test, apop_model *est; apop_data *contrast); #define apop_f_test(...) apop_varad_link(apop_f_test, __VA_ARGS__) #endif #define apop_F_test apop_f_test //from the regression code: #define apop_estimate_r_squared(in) apop_estimate_coefficient_of_determination(in) apop_data * apop_text_unique_elements(const apop_data *d, size_t col); gsl_vector * apop_vector_unique_elements(const gsl_vector *v); #ifdef APOP_NO_VARIADIC apop_data * apop_data_to_factors(apop_data *data, char intype, int incol, int outcol) ; #else apop_data * apop_data_to_factors_base(apop_data *data, char intype, int incol, int outcol) ; apop_varad_declare(apop_data *, apop_data_to_factors, apop_data *data; char intype; int incol; int outcol); #define apop_data_to_factors(...) apop_varad_link(apop_data_to_factors, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_data_get_factor_names(apop_data *data, int col, char type) ; #else apop_data * apop_data_get_factor_names_base(apop_data *data, int col, char type) ; apop_varad_declare(apop_data *, apop_data_get_factor_names, apop_data *data; int col; char type); #define apop_data_get_factor_names(...) apop_varad_link(apop_data_get_factor_names, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_data_to_dummies(apop_data *d, int col, char type, int keep_first, char append, char remove) ; #else apop_data * apop_data_to_dummies_base(apop_data *d, int col, char type, int keep_first, char append, char remove) ; apop_varad_declare(apop_data *, apop_data_to_dummies, apop_data *d; int col; char type; int keep_first; char append; char remove); #define apop_data_to_dummies(...) apop_varad_link(apop_data_to_dummies, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC long double apop_model_entropy(apop_model *in, int draws) ; #else long double apop_model_entropy_base(apop_model *in, int draws) ; apop_varad_declare(long double, apop_model_entropy, apop_model *in; int draws); #define apop_model_entropy(...) apop_varad_link(apop_model_entropy, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC long double apop_kl_divergence(apop_model *from, apop_model *to, int draw_ct, gsl_rng *rng) ; #else long double apop_kl_divergence_base(apop_model *from, apop_model *to, int draw_ct, gsl_rng *rng) ; apop_varad_declare(long double, apop_kl_divergence, apop_model *from; apop_model *to; int draw_ct; gsl_rng *rng); #define apop_kl_divergence(...) apop_varad_link(apop_kl_divergence, __VA_ARGS__) #endif apop_data *apop_estimate_coefficient_of_determination (apop_model *); void apop_estimate_parameter_tests (apop_model *est); //Bootstrapping & RNG apop_data * apop_jackknife_cov(apop_data *data, apop_model *model); #ifdef APOP_NO_VARIADIC apop_data * apop_bootstrap_cov(apop_data *data, apop_model *model, gsl_rng* rng, int iterations, char keep_boots, char ignore_nans, apop_data **boot_store) ; #else apop_data * apop_bootstrap_cov_base(apop_data *data, apop_model *model, gsl_rng* rng, int iterations, char keep_boots, char ignore_nans, apop_data **boot_store) ; apop_varad_declare(apop_data *, apop_bootstrap_cov, apop_data *data; apop_model *model; gsl_rng* rng; int iterations; char keep_boots; char ignore_nans; apop_data **boot_store); #define apop_bootstrap_cov(...) apop_varad_link(apop_bootstrap_cov, __VA_ARGS__) #endif gsl_rng *apop_rng_alloc(int seed); double apop_rng_GHgB3(gsl_rng * r, double* a); //in apop_asst.c #define apop_rng_get_thread(thread_in) apop_rng_get_thread_base(#thread_in[0]=='\0' ? -1: (thread_in+0)) gsl_rng *apop_rng_get_thread_base(int thread); int apop_arms_draw (double *out, gsl_rng *r, apop_model *m); // maximum likelihod estimation related functions #ifdef APOP_NO_VARIADIC gsl_vector * apop_numerical_gradient(apop_data * data, apop_model* model, double delta) ; #else gsl_vector * apop_numerical_gradient_base(apop_data * data, apop_model* model, double delta) ; apop_varad_declare(gsl_vector *, apop_numerical_gradient, apop_data * data; apop_model* model; double delta); #define apop_numerical_gradient(...) apop_varad_link(apop_numerical_gradient, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_model_hessian(apop_data * data, apop_model *model, double delta) ; #else apop_data * apop_model_hessian_base(apop_data * data, apop_model *model, double delta) ; apop_varad_declare(apop_data *, apop_model_hessian, apop_data * data; apop_model *model; double delta); #define apop_model_hessian(...) apop_varad_link(apop_model_hessian, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_data * apop_model_numerical_covariance(apop_data * data, apop_model *model, double delta) ; #else apop_data * apop_model_numerical_covariance_base(apop_data * data, apop_model *model, double delta) ; apop_varad_declare(apop_data *, apop_model_numerical_covariance, apop_data * data; apop_model *model; double delta); #define apop_model_numerical_covariance(...) apop_varad_link(apop_model_numerical_covariance, __VA_ARGS__) #endif void apop_maximum_likelihood(apop_data * data, apop_model *dist); #ifdef APOP_NO_VARIADIC apop_model * apop_estimate_restart (apop_model *e, apop_model *copy, char * starting_pt, double boundary) ; #else apop_model * apop_estimate_restart_base(apop_model *e, apop_model *copy, char * starting_pt, double boundary) ; apop_varad_declare(apop_model *, apop_estimate_restart, apop_model *e; apop_model *copy; char * starting_pt; double boundary); #define apop_estimate_restart(...) apop_varad_link(apop_estimate_restart, __VA_ARGS__) #endif //in apop_linear_constraint.c #ifdef APOP_NO_VARIADIC long double apop_linear_constraint(gsl_vector *beta, apop_data * constraint, double margin) ; #else long double apop_linear_constraint_base(gsl_vector *beta, apop_data * constraint, double margin) ; apop_varad_declare(long double, apop_linear_constraint, gsl_vector *beta; apop_data * constraint; double margin); #define apop_linear_constraint(...) apop_varad_link(apop_linear_constraint, __VA_ARGS__) #endif //in apop_model_fix_params.c apop_model * apop_model_fix_params(apop_model *model_in); apop_model * apop_model_fix_params_get_base(apop_model *model_in); //////vtables /** \cond doxy_ignore */ /* This declares the vtable macros for each procedure that uses the mechanism. --We want to have type-checking on the functions put into the vtables. Type checking happens only with functions, not macros, so we need a type_check function for every vtable. --Only once in your codebase, you'll need to #define Declare_type_checking_fns to actually define the type checking function. Everywhere else, the function is merely declared. --All other uses point to having a macro, such as using __VA_ARGS__ to allow any sort of inputs to the hash. --We want to have such a macro for every vtable. That means that we need a macro to write macros. We can't do that with C macros, so this file uses m4 macros to generate C macros. --After the m4 definition of make_vtab_fns, each new vtable requires a typedef, a hash definition, and a call to make_vtab_fns to do the rest. */ int apop_vtable_add(char const *tabname, void *fn_in, unsigned long hash); void *apop_vtable_get(char const *tabname, unsigned long hash); int apop_vtable_drop(char const *tabname, unsigned long hash); typedef apop_model *(*apop_update_type)(apop_data *, apop_model* , apop_model*); #define apop_update_hash(m1, m2) ( \ ((m1)->log_likelihood ? (size_t)(m1)->log_likelihood : \ (m1)->p ? (size_t)(m1)->p*33 : \ (m1)->draw ? (size_t)(m1)->draw*33*27 \ : 33*27*19) \ +((m2)->log_likelihood ? (size_t)(m2)->log_likelihood : \ (m2)->p ? (size_t)(m2)->p*33 : \ (m2)->draw ? (size_t)(m2)->draw*33*27 \ : 33*27*19 \ ) * 37) #ifdef Declare_type_checking_fns void apop_update_type_check(apop_update_type in){ }; #else void apop_update_type_check(apop_update_type in); #endif #define apop_update_vtable_add(fn, ...) apop_update_type_check(fn), apop_vtable_add("apop_update", fn, apop_update_hash(__VA_ARGS__)) #define apop_update_vtable_get(...) apop_vtable_get("apop_update", apop_update_hash(__VA_ARGS__)) #define apop_update_vtable_drop(...) apop_vtable_drop("apop_update", apop_update_hash(__VA_ARGS__)) typedef long double (*apop_entropy_type)(apop_model *model); #define apop_entropy_hash(m1) ((size_t)(m1)->log_likelihood + 33 * (size_t)((m1)->p) + 27*(size_t)((m1)->draw)) #ifdef Declare_type_checking_fns void apop_entropy_type_check(apop_entropy_type in){ }; #else void apop_entropy_type_check(apop_entropy_type in); #endif #define apop_entropy_vtable_add(fn, ...) apop_entropy_type_check(fn), apop_vtable_add("apop_entropy", fn, apop_entropy_hash(__VA_ARGS__)) #define apop_entropy_vtable_get(...) apop_vtable_get("apop_entropy", apop_entropy_hash(__VA_ARGS__)) #define apop_entropy_vtable_drop(...) apop_vtable_drop("apop_entropy", apop_entropy_hash(__VA_ARGS__)) typedef void (*apop_score_type)(apop_data *d, gsl_vector *gradient, apop_model *params); #define apop_score_hash(m1) ((size_t)((m1)->log_likelihood ? (m1)->log_likelihood : (m1)->p)) #ifdef Declare_type_checking_fns void apop_score_type_check(apop_score_type in){ }; #else void apop_score_type_check(apop_score_type in); #endif #define apop_score_vtable_add(fn, ...) apop_score_type_check(fn), apop_vtable_add("apop_score", fn, apop_score_hash(__VA_ARGS__)) #define apop_score_vtable_get(...) apop_vtable_get("apop_score", apop_score_hash(__VA_ARGS__)) #define apop_score_vtable_drop(...) apop_vtable_drop("apop_score", apop_score_hash(__VA_ARGS__)) typedef apop_model* (*apop_parameter_model_type)(apop_data *, apop_model *); #define apop_parameter_model_hash(m1) ((size_t)((m1)->log_likelihood ? (m1)->log_likelihood : (m1)->p)*33 + (m1)->estimate ? (size_t)(m1)->estimate: 27) #ifdef Declare_type_checking_fns void apop_parameter_model_type_check(apop_parameter_model_type in){ }; #else void apop_parameter_model_type_check(apop_parameter_model_type in); #endif #define apop_parameter_model_vtable_add(fn, ...) apop_parameter_model_type_check(fn), apop_vtable_add("apop_parameter_model", fn, apop_parameter_model_hash(__VA_ARGS__)) #define apop_parameter_model_vtable_get(...) apop_vtable_get("apop_parameter_model", apop_parameter_model_hash(__VA_ARGS__)) #define apop_parameter_model_vtable_drop(...) apop_vtable_drop("apop_parameter_model", apop_parameter_model_hash(__VA_ARGS__)) typedef apop_data * (*apop_predict_type)(apop_data *d, apop_model *params); #define apop_predict_hash(m1) ((size_t)((m1)->log_likelihood ? (m1)->log_likelihood : (m1)->p)*33 + (m1)->estimate ? (size_t)(m1)->estimate: 27) #ifdef Declare_type_checking_fns void apop_predict_type_check(apop_predict_type in){ }; #else void apop_predict_type_check(apop_predict_type in); #endif #define apop_predict_vtable_add(fn, ...) apop_predict_type_check(fn), apop_vtable_add("apop_predict", fn, apop_predict_hash(__VA_ARGS__)) #define apop_predict_vtable_get(...) apop_vtable_get("apop_predict", apop_predict_hash(__VA_ARGS__)) #define apop_predict_vtable_drop(...) apop_vtable_drop("apop_predict", apop_predict_hash(__VA_ARGS__)) typedef void (*apop_model_print_type)(apop_model *params, FILE *out); #define apop_model_print_hash(m1) ((m1)->log_likelihood ? (size_t)(m1)->log_likelihood : \ (m1)->p ? (size_t)(m1)->p*33 : \ (m1)->estimate ? (size_t)(m1)->estimate*33*33 : \ (m1)->draw ? (size_t)(m1)->draw*33*27 : \ (m1)->cdf ? (size_t)(m1)->cdf*27*27 \ : 27) #ifdef Declare_type_checking_fns void apop_model_print_type_check(apop_model_print_type in){ }; #else void apop_model_print_type_check(apop_model_print_type in); #endif #define apop_model_print_vtable_add(fn, ...) apop_model_print_type_check(fn), apop_vtable_add("apop_model_print", fn, apop_model_print_hash(__VA_ARGS__)) #define apop_model_print_vtable_get(...) apop_vtable_get("apop_model_print", apop_model_print_hash(__VA_ARGS__)) #define apop_model_print_vtable_drop(...) apop_vtable_drop("apop_model_print", apop_model_print_hash(__VA_ARGS__)) /** \endcond */ //End of Doxygen ignore. //////Asst long double apop_generalized_harmonic(int N, double s) __attribute__ ((__pure__)); apop_data * apop_test_anova_independence(apop_data *d); #define apop_test_ANOVA_independence(d) apop_test_anova_independence(d) #ifdef APOP_NO_VARIADIC int apop_regex(const char *string, const char* regex, apop_data **substrings, const char use_case) ; #else int apop_regex_base(const char *string, const char* regex, apop_data **substrings, const char use_case) ; apop_varad_declare(int, apop_regex, const char *string; const char* regex; apop_data **substrings; const char use_case); #define apop_regex(...) apop_varad_link(apop_regex, __VA_ARGS__) #endif int apop_system(const char *fmt, ...) __attribute__ ((format (printf,1,2))); //Histograms and PMFs gsl_vector * apop_vector_moving_average(gsl_vector *, size_t); apop_data * apop_histograms_test_goodness_of_fit(apop_model *h0, apop_model *h1); apop_data * apop_test_kolmogorov(apop_model *m1, apop_model *m2); apop_data *apop_data_pmf_compress(apop_data *in); #ifdef APOP_NO_VARIADIC apop_data * apop_data_to_bins(apop_data const *indata, apop_data const *binspec, int bin_count, char close_top_bin) ; #else apop_data * apop_data_to_bins_base(apop_data const *indata, apop_data const *binspec, int bin_count, char close_top_bin) ; apop_varad_declare(apop_data *, apop_data_to_bins, apop_data const *indata; apop_data const *binspec; int bin_count; char close_top_bin); #define apop_data_to_bins(...) apop_varad_link(apop_data_to_bins, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_model * apop_model_to_pmf(apop_model *model, apop_data *binspec, long int draws, int bin_count) ; #else apop_model * apop_model_to_pmf_base(apop_model *model, apop_data *binspec, long int draws, int bin_count) ; apop_varad_declare(apop_model *, apop_model_to_pmf, apop_model *model; apop_data *binspec; long int draws; int bin_count); #define apop_model_to_pmf(...) apop_varad_link(apop_model_to_pmf, __VA_ARGS__) #endif //text conveniences #ifdef APOP_NO_VARIADIC char* apop_text_paste(apop_data const*strings, char *between, char *before, char *after, char *between_cols, int (*prune)(apop_data* , int , int , void*), void* prune_parameter) ; #else char* apop_text_paste_base(apop_data const*strings, char *between, char *before, char *after, char *between_cols, int (*prune)(apop_data* , int , int , void*), void* prune_parameter) ; apop_varad_declare(char*, apop_text_paste, apop_data const*strings; char *between; char *before; char *after; char *between_cols; int (*prune)(apop_data* , int , int , void*); void* prune_parameter); #define apop_text_paste(...) apop_varad_link(apop_text_paste, __VA_ARGS__) #endif /** Notify the user of errors, warning, or debug info. writes to \ref apop_opts.log_file, which is a \c FILE handle. The default is \c stderr, but use \c fopen to attach to a file. \param verbosity At what verbosity level should the user be warned? E.g., if level==2, then print iff apop_opts.verbosity >= 2. \param ... The message to write to the log (presuming the verbosity level is high enough). This can be a printf-style format with following arguments, e.g., apop_notify(0, "Beta is currently %g", beta). */ #define Apop_notify(verbosity, ...) {\ if (apop_opts.verbose != -1 && apop_opts.verbose >= verbosity) { \ if (!apop_opts.log_file) apop_opts.log_file = stderr; \ fprintf(apop_opts.log_file, "%s: ", __func__); fprintf(apop_opts.log_file, __VA_ARGS__); fprintf(apop_opts.log_file, "\n"); \ fflush(apop_opts.log_file); \ } } /** \cond doxy_ignore */ #define Apop_maybe_abort(level) \ {if ((apop_opts.verbose >= level && apop_opts.stop_on_warning == 'v') \ || (apop_opts.stop_on_warning=='w') ) \ raise(SIGTRAP);} /** \endcond */ /** Execute an action and print a message to the current \c FILE handle held by apop_opts.log_file (default: \c stderr). \param test The expression that, if true, triggers the action. \param onfail If the assertion fails, do this. E.g., out->error='x'; return GSL_NAN. Notice that it is OK to include several lines of semicolon-separated code here, but if you have a lot to do, the most readable option may be goto outro, plus an appropriately-labeled section at the end of your function. \param level Print the warning message only if \ref apop_opts_type "apop_opts.verbose" is greater than or equal to this. Zero usually works, but for minor infractions use one, or for more verbose debugging output use 2. \param ... The error message in printf form, plus any arguments to be inserted into the printf string. I'll provide the function name and a carriage return. Some examples: \code //the typical case, stopping function execution: Apop_stopif(isnan(x), return NAN, 0, "x is NAN; failing"); //Mark a flag, go to a cleanup step Apop_stopif(x < 0, needs_cleanup=1; goto cleanup, 0, "x is %g; cleaning up and exiting.", x); //Print a diagnostic iff apop_opts.verbose>=1 and continue Apop_stopif(x < 0, , 1, "warning: x is %g.", x); \endcode \li If \c apop_opts.stop_on_warning is nonzero and not 'v', then a failed test halts via \c abort(), even if the apop_opts.verbose level is set so that the warning message doesn't print to screen. Use this when running via debugger. \li If \c apop_opts.stop_on_warning is 'v', then a failed test halts via \c abort() iff the verbosity level is high enough to print the error. */ #define Apop_stopif(test, onfail, level, ...) do {\ if (test) { \ Apop_notify(level, __VA_ARGS__); \ Apop_maybe_abort(level) \ onfail; \ } } while(0) #define apop_errorlevel -5 /** \cond doxy_ignore */ //For use in stopif, to return a blank apop_data set with an error attached. #define apop_return_data_error(E) {apop_data *out=apop_data_alloc(); out->error='E'; return out;} /* The Apop_stopif macro is currently favored, but there's a long history of prior error-handling setups. Consider all of the Assert... macros below to be deprecated. */ #define Apop_assert_c(test, returnval, level, ...) \ Apop_stopif(!(test), return returnval, level, __VA_ARGS__) #define Apop_assert(test, ...) Apop_assert_c((test), 0, apop_errorlevel, __VA_ARGS__) //For things that return void. Transitional and deprecated at birth. #define Apop_assert_n(test, ...) Apop_assert_c((test), , apop_errorlevel, __VA_ARGS__) #define Apop_assert_negone(test, ...) Apop_assert_c((test), -1, apop_errorlevel, __VA_ARGS__) /** \endcond */ //End of Doxygen ignore. //Missing data #ifdef APOP_NO_VARIADIC apop_data * apop_data_listwise_delete(apop_data *d, char inplace) ; #else apop_data * apop_data_listwise_delete_base(apop_data *d, char inplace) ; apop_varad_declare(apop_data *, apop_data_listwise_delete, apop_data *d; char inplace); #define apop_data_listwise_delete(...) apop_varad_link(apop_data_listwise_delete, __VA_ARGS__) #endif apop_model * apop_ml_impute(apop_data *d, apop_model* meanvar); #ifdef APOP_NO_VARIADIC apop_model *apop_model_metropolis(apop_data *d, gsl_rng* rng, apop_model *m); #else apop_model * apop_model_metropolis_base(apop_data *d, gsl_rng* rng, apop_model *m); apop_varad_declare(apop_model *, apop_model_metropolis, apop_data *d; gsl_rng* rng; apop_model *m); #define apop_model_metropolis(...) apop_varad_link(apop_model_metropolis, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC apop_model * apop_update(apop_data *data, apop_model *prior, apop_model *likelihood, gsl_rng *rng) ; #else apop_model * apop_update_base(apop_data *data, apop_model *prior, apop_model *likelihood, gsl_rng *rng) ; apop_varad_declare(apop_model *, apop_update, apop_data *data; apop_model *prior; apop_model *likelihood; gsl_rng *rng); #define apop_update(...) apop_varad_link(apop_update, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC double apop_test(double statistic, char *distribution, double p1, double p2, char tail) ; #else double apop_test_base(double statistic, char *distribution, double p1, double p2, char tail) ; apop_varad_declare(double, apop_test, double statistic; char *distribution; double p1; double p2; char tail); #define apop_test(...) apop_varad_link(apop_test, __VA_ARGS__) #endif //apop_sort.c #ifdef APOP_NO_VARIADIC apop_data *apop_data_sort(apop_data *data, apop_data *sort_order, char asc, char inplace, double *col_order); #else apop_data * apop_data_sort_base(apop_data *data, apop_data *sort_order, char asc, char inplace, double *col_order); apop_varad_declare(apop_data *, apop_data_sort, apop_data *data; apop_data *sort_order; char asc; char inplace; double *col_order); #define apop_data_sort(...) apop_varad_link(apop_data_sort, __VA_ARGS__) #endif //raking #ifdef APOP_NO_VARIADIC apop_data * apop_rake(char const *margin_table, char * const*var_list, int var_ct, char * const *contrasts, int contrast_ct, char const *structural_zeros, int max_iterations, double tolerance, char const *count_col, char const *init_table, char const *init_count_col, double nudge) ; #else apop_data * apop_rake_base(char const *margin_table, char * const*var_list, int var_ct, char * const *contrasts, int contrast_ct, char const *structural_zeros, int max_iterations, double tolerance, char const *count_col, char const *init_table, char const *init_count_col, double nudge) ; apop_varad_declare(apop_data *, apop_rake, char const *margin_table; char * const*var_list; int var_ct; char * const *contrasts; int contrast_ct; char const *structural_zeros; int max_iterations; double tolerance; char const *count_col; char const *init_table; char const *init_count_col; double nudge); #define apop_rake(...) apop_varad_link(apop_rake, __VA_ARGS__) #endif #include #include #include #include #include #include #include #include #include #include //Some linear algebra utilities double apop_det_and_inv(const gsl_matrix *in, gsl_matrix **out, int calc_det, int calc_inv); #ifdef APOP_NO_VARIADIC apop_data * apop_dot(const apop_data *d1, const apop_data *d2, char form1, char form2) ; #else apop_data * apop_dot_base(const apop_data *d1, const apop_data *d2, char form1, char form2) ; apop_varad_declare(apop_data *, apop_dot, const apop_data *d1; const apop_data *d2; char form1; char form2); #define apop_dot(...) apop_varad_link(apop_dot, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC int apop_vector_bounded(const gsl_vector *in, long double max) ; #else int apop_vector_bounded_base(const gsl_vector *in, long double max) ; apop_varad_declare(int, apop_vector_bounded, const gsl_vector *in; long double max); #define apop_vector_bounded(...) apop_varad_link(apop_vector_bounded, __VA_ARGS__) #endif gsl_matrix * apop_matrix_inverse(const gsl_matrix *in) ; double apop_matrix_determinant(const gsl_matrix *in) ; //apop_data* apop_sv_decomposition(gsl_matrix *data, int dimensions_we_want); #ifdef APOP_NO_VARIADIC apop_data * apop_matrix_pca(gsl_matrix *data, int const dimensions_we_want) ; #else apop_data * apop_matrix_pca_base(gsl_matrix *data, int const dimensions_we_want) ; apop_varad_declare(apop_data *, apop_matrix_pca, gsl_matrix *data; int const dimensions_we_want); #define apop_matrix_pca(...) apop_varad_link(apop_matrix_pca, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC gsl_vector * apop_vector_stack(gsl_vector *v1, gsl_vector const * v2, char inplace) ; #else gsl_vector * apop_vector_stack_base(gsl_vector *v1, gsl_vector const * v2, char inplace) ; apop_varad_declare(gsl_vector *, apop_vector_stack, gsl_vector *v1; gsl_vector const * v2; char inplace); #define apop_vector_stack(...) apop_varad_link(apop_vector_stack, __VA_ARGS__) #endif #ifdef APOP_NO_VARIADIC gsl_matrix * apop_matrix_stack(gsl_matrix *m1, gsl_matrix const * m2, char posn, char inplace) ; #else gsl_matrix * apop_matrix_stack_base(gsl_matrix *m1, gsl_matrix const * m2, char posn, char inplace) ; apop_varad_declare(gsl_matrix *, apop_matrix_stack, gsl_matrix *m1; gsl_matrix const * m2; char posn; char inplace); #define apop_matrix_stack(...) apop_varad_link(apop_matrix_stack, __VA_ARGS__) #endif void apop_vector_log(gsl_vector *v); void apop_vector_log10(gsl_vector *v); void apop_vector_exp(gsl_vector *v); ////Subsetting macros /** \cond doxy_ignore */ /** These are all deprecated.*/ #define APOP_SUBMATRIX(m, srow, scol, nrows, ncols, o) gsl_matrix apop_mm_##o = gsl_matrix_submatrix((m), (srow), (scol), (nrows),(ncols)).matrix;\ gsl_matrix * o = &( apop_mm_##o ); // Use \ref Apop_subm. #define Apop_submatrix APOP_SUBMATRIX #define Apop_col_v(m, col, v) gsl_vector apop_vv_##v = ((col) == -1) ? (gsl_vector){} : gsl_matrix_column((m)->matrix, (col)).vector;\ gsl_vector * v = ((col)==-1) ? (m)->vector : &( apop_vv_##v ); // Use \ref Apop_cv. #define Apop_row_v(m, row, v) Apop_matrix_row((m)->matrix, row, v) // Use \ref Apop_rv. #define Apop_rows(d, rownum, len, outd) apop_data *outd = Apop_rs(d, rownum, len) // Use \ref Apop_rs. #define Apop_row(d, row, outd) Apop_rows(d, row, 1, outd) // Use \ref Apop_r. #define Apop_cols(d, colnum, len, outd) apop_data *outd = Apop_cs(d, colnum, len); // Use \ref Apop_cs. /** \endcond */ //End of Doxygen ignore. /** \def Apop_row_tv(m, row_name, v) After this call, \c v will hold a \c gsl_vector view of an \ref apop_data set \c m. The view will consist only of the row with name \c row_name. Unlike \ref Apop_rv, the second argument is a row name, that I'll look up using \ref apop_name_find, and the third is the name of the view to be generated. \see Apop_rs, Apop_r, Apop_rv, Apop_row_t, Apop_mrv */ #define Apop_row_tv(m, row, v) gsl_vector apop_vv_##v = gsl_matrix_row((m)->matrix, apop_name_find((m)->names, row, 'r')).vector;\ gsl_vector * v = &( apop_vv_##v ); /** \def Apop_col_tv(m, col_name, v) After this call, \c v will hold a \c gsl_vector view of the \ref apop_data set \c m. The view will consist only of the column with name \c col_name. Unlike \ref Apop_cv, the second argument is a column name, that I'll look up using \ref apop_name_find, and the third is the name of the view to be generated. \see Apop_cs, Apop_c, Apop_cv, Apop_col_t, Apop_mcv */ #define Apop_col_tv(m, col, v) gsl_vector apop_vv_##v = gsl_matrix_column((m)->matrix, apop_name_find((m)->names, col, 'c')).vector;\ gsl_vector * v = &( apop_vv_##v ); /** \def Apop_row_t(m, row_name, v) After this call, \c v will hold an \ref apop_data view of an \ref apop_data set \c m. The view will consist only of the row with name \c row_name. Unlike \ref Apop_r, the second argument is a row name, that I'll look up using \ref apop_name_find, and the third is the name of the view to be generated. \see Apop_rs, Apop_r, Apop_rv, Apop_row_tv, Apop_mrv */ #define Apop_row_t(d, rowname, outd) int apop_row_##outd = apop_name_find((d)->names, rowname, 'r'); Apop_rows(d, apop_row_##outd, 1, outd) /** \def Apop_col_t(m, col_name, v) After this call, \c v will hold a view of the \ref apop_data set \c m. The view will consist only of a \c gsl_vector view of the column of the \ref apop_data set \c m with name \c col_name. Unlike \ref Apop_c, the second argument is a column name, that I'll look up using \ref apop_name_find, and the third is the name of the view to be generated. \see Apop_cs, Apop_c, Apop_cv, Apop_col_tv, Apop_mcv */ #define Apop_col_t(d, colname, outd) int apop_col_##outd = apop_name_find((d)->names, colname, 'c'); Apop_cols(d, apop_col_##outd, 1, outd) // The above versions relied on gsl_views, which stick to C as of 1989 CE. // Better to just create the views via designated initializers. /** \def Apop_subm(data_to_view, srow, scol, nrows, ncols) Generate a view of a submatrix within a \c gsl_matrix. Like \ref Apop_r, et al., the view is an automatically-allocated variable that is lost once the program flow leaves the scope in which it is declared. \param data_to_view The root matrix \param srow the first row (in the root matrix) of the top of the submatrix \param scol the first column (in the root matrix) of the left edge of the submatrix \param nrows number of rows in the submatrix \param ncols number of columns in the submatrix \return An automatically-allocated view of type \c gsl_matrix. */ #define Apop_subm(matrix_to_view, srow, scol, nrows, ncols)( \ (!(matrix_to_view) \ || (matrix_to_view)->size1 < (srow)+(nrows) || (srow) < 0 \ || (matrix_to_view)->size2 < (scol)+(ncols) || (scol) < 0) ? NULL \ : &(gsl_matrix){.size1=(nrows), .size2=(ncols), \ .tda=(matrix_to_view)->tda, \ .data=gsl_matrix_ptr((matrix_to_view), (srow), (scol))} \ ) /** Get a vector view of a single row of a \ref gsl_matrix. \param matrix_to_vew A \ref gsl_matrix. \param row An integer giving the row to be viewed. \return A \c gsl_vector view of the given row. The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. See \ref apop_vector_correlation for an example of use. \see Apop_r, Apop_rv */ #define Apop_mrv(matrix_to_view, row) Apop_rv(&(apop_data){.matrix=matrix_to_view}, row) /** Get a vector view of a single column of a \ref gsl_matrix. \param matrix_to_vew A \ref gsl_matrix. \param row An integer giving the column to be viewed. \return A \c gsl_vector view of the given column. The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. \code gsl_matrix *m = apop_query_to_data("select col1, col2, col3 from data")->matrix; printf("The correlation coefficient between columns two " "and three is %g.\n", apop_vector_correlation(Apop_mcv(m, 2), Apop_mcv(m, 3))); \endcode \see Apop_r, Apop_cv */ #define Apop_mcv(matrix_to_view, col) Apop_cv(&(apop_data){.matrix=matrix_to_view}, col) /** \def Apop_rv(d, row) A macro to generate a temporary one-row view of the matrix in an \ref apop_data set \c d, pulling out only row \c row. The view is a \c gsl_vector set. \code gsl_vector *v = Apop_rv(your_data, i); for (int i=0; i< your_data->matrix->size1; i++) printf("Σ_%i = %g\n", i, apop_vector_sum(Apop_r(your_data, i))); \endcode The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. \see Apop_r, Apop_rv, Apop_row_tv, Apop_row_t, Apop_mrv */ #define Apop_rv(data_to_view, row) ( \ ((data_to_view) == NULL || (data_to_view)->matrix == NULL \ || (data_to_view)->matrix->size1 <= (row) || (row) < 0) ? NULL \ : &(gsl_vector){.size=(data_to_view)->matrix->size2, \ .stride=1, .data=gsl_matrix_ptr((data_to_view)->matrix, (row), 0)} \ ) /** \def Apop_cv(d, col) A macro to generate a temporary one-column view of the matrix in an \ref apop_data set \c d, pulling out only column \c col. The view is a \c gsl_vector set. As usual, column -1 is the vector element of the \ref apop_data set. \code gsl_vector *v = Apop_cv(your_data, i); for (int i=0; i< your_data->matrix->size2; i++) printf("Σ_%i = %g\n", i, apop_vector_sum(Apop_c(your_data, i))); \endcode The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. \see Apop_cs, Apop_c, Apop_col_tv, Apop_col_t, Apop_mcv */ #define Apop_cv(data_to_view, col) ( \ !(data_to_view) ? NULL \ : (col)==-1 ? (data_to_view)->vector \ : (!(data_to_view)->matrix \ || (data_to_view)->matrix->size2 <= (col) || ((int)(col)) < -1) ? NULL \ : &(gsl_vector){.size=(data_to_view)->matrix->size1, \ .stride=(data_to_view)->matrix->tda, .data=gsl_matrix_ptr((data_to_view)->matrix, 0, (col))} \ ) /** \cond doxy_ignore */ /* Not (yet) for public use. */ #define Apop_subvector(v, start, len) ( \ ((v) == NULL || (v)->size < ((start)+(len)) || (start) < 0) ? NULL \ : &(gsl_vector){.size=(len), .stride=(v)->stride, .data=(v)->data+(start*(v)->stride)}) /** \endcond */ /** \def Apop_rs(d, row, len) A macro to generate a temporary view of \ref apop_data set \c d pulling only certain rows, beginning at row \c row and having height \c len. The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. \see Apop_r, Apop_rv, Apop_row_tv, Apop_row_t, Apop_mrv */ #define Apop_rs(d, rownum, len)( \ (!(d) || (rownum) < 0) ? NULL \ : &(apop_data){ \ .names= ( !((d)->names) ? NULL : \ &(apop_name){ \ .title = (d)->names->title, \ .vector = (d)->names->vector, \ .col = (d)->names->col, \ .row = ((d)->names->row && (d)->names->rowct > (rownum)) ? &((d)->names->row[rownum]) : NULL, \ .text = (d)->names->text, \ .colct = (d)->names->colct, \ .rowct = (d)->names->row ? (GSL_MIN(1, GSL_MAX((d)->names->rowct - (int)(rownum), 0))) \ : 0, \ .textct = (d)->names->textct }), \ .vector= Apop_subvector((d->vector), (rownum), (len)), \ .matrix = Apop_subm(((d)->matrix), (rownum), 0, (len), (d)->matrix?(d)->matrix->size2:0), \ .weights = Apop_subvector(((d)->weights), (rownum), (len)), \ .textsize[0]=(d)->textsize[0]> (rownum)+(len)-1 ? (len) : 0, \ .textsize[1]=(d)->textsize[1], \ .text = (d)->text ? &((d)->text[rownum]) : NULL, \ }) /** \def Apop_cs(d, col, len) A macro to generate a temporary view of \ref apop_data set \c d including only certain columns, beginning at column \c col and having length \c len. The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. \see Apop_c, Apop_cv, Apop_col_tv, Apop_col_t, Apop_mcv */ #define Apop_cs(d, colnum, len) ( \ (!(d)||!(d)->matrix || (d)->matrix->size2 <= (colnum)+(len)-1) \ ? NULL \ : &(apop_data){ \ .vector= NULL, \ .weights= (d)->weights, \ .matrix = Apop_subm((d)->matrix, 0, colnum, (d)->matrix->size1, (len)),\ .textsize[0] = 0, \ .textsize[1] = 0, \ .text = NULL, \ .names= (d)->names ? &(apop_name){ \ .title = (d)->names->title, \ .vector = NULL, \ .row = (d)->names->row, \ .col = ((d)->names->col && (d)->names->colct > colnum) ? &((d)->names->col[colnum]) : NULL, \ .text = NULL, \ .rowct = (d)->names->rowct, \ .colct = (d)->names->col ? (GSL_MIN(len, GSL_MAX((d)->names->colct - colnum, 0))) \ : 0, \ .textct = (d)->names->textct } : NULL \ }) /** \def Apop_r(d, row) A macro to generate a temporary one-row view of \ref apop_data set \c d, pulling out only row \c row. The view is also an \ref apop_data set, with names and other decorations. \code //pull a single row apop_data *v = Apop_r(your_data, 7); //or loop through a sequence of one-row data sets. apop_model *std = apop_model_set_parameters(apop_normal, 0, 1); for (int i=0; i< your_data->matrix->size1; i++) printf("Std Normal CDF up to observation %i is %g\n", i, apop_cdf(Apop_r(your_data, i), std)); \endcode The view is automatically allocated, and disappears as soon as the program leaves the scope in which it is declared. \see Apop_rs, Apop_row_v, Apop_row_tv, Apop_row_t, Apop_mrv */ #define Apop_r(d, rownum) Apop_rs(d, rownum, 1) /** \def Apop_c(d, col) A macro to generate a temporary one-column view of \ref apop_data set \c d, pulling out only column \c col. After this call, \c outd will be a pointer to this temporary view, that you can use as you would any \ref apop_data set. \see Apop_cs, Apop_cv, Apop_col_tv, Apop_col_t, Apop_mcv */ #define Apop_c(d, col) Apop_cs(d, col, 1) /** \cond doxy_ignore */ #define APOP_COL Apop_col #define apop_col Apop_col #define APOP_COL_T Apop_col_t #define apop_col_t Apop_col_t #define APOP_COL_TV Apop_col_tv #define apop_col_tv Apop_col_tv #define APOP_ROW Apop_row #define apop_row Apop_row #define APOP_COLS Apop_cols #define apop_cols Apop_cols #define APOP_COL_V Apop_col_v #define apop_col_v Apop_col_v #define APOP_ROW_V Apop_row_v #define apop_row_v Apop_row_v #define APOP_ROWS Apop_rows #define apop_rows Apop_rows #define Apop_data_row Apop_row #deprecated #define APOP_ROW_T Apop_row_t #define apop_row_t Apop_row_t #define APOP_ROW_TV Apop_row_tv #define apop_row_tv Apop_row_tv /** Deprecated. Use Apop_mrv */ #define Apop_matrix_row(m, row, v) gsl_vector apop_vv_##v = gsl_matrix_row((m), (row)).vector;\ gsl_vector * v = &( apop_vv_##v ); /* Deprecated. Use Apop_mcv */ #define Apop_matrix_col(m, col, v) gsl_vector apop_vv_##v = gsl_matrix_column((m), (col)).vector;\ gsl_vector * v = &( apop_vv_##v ); #define APOP_MATRIX_ROW Apop_matrix_row #define apop_matrix_row Apop_matrix_row #define APOP_MATRIX_COL Apop_matrix_col #define apop_matrix_col Apop_matrix_col /** \endcond */ long double apop_vector_sum(const gsl_vector *in); double apop_vector_var_m(const gsl_vector *in, const double mean); #ifdef APOP_NO_VARIADIC double apop_vector_correlation(const gsl_vector *ina, const gsl_vector *inb, const gsl_vector *weights) ; #else double apop_vector_correlation_base(const gsl_vector *ina, const gsl_vector *inb, const gsl_vector *weights) ; apop_varad_declare(double, apop_vector_correlation, const gsl_vector *ina; const gsl_vector *inb; const gsl_vector *weights); #define apop_vector_correlation(...) apop_varad_link(apop_vector_correlation, __VA_ARGS__) #endif double apop_vector_kurtosis(const gsl_vector *in); double apop_vector_skew(const gsl_vector *in); #define apop_sum apop_vector_sum #define apop_var apop_vector_var #define apop_mean apop_vector_mean //////database utilities #ifdef APOP_NO_VARIADIC int apop_table_exists(char const *name, char remove) ; #else int apop_table_exists_base(char const *name, char remove) ; apop_varad_declare(int, apop_table_exists, char const *name; char remove); #define apop_table_exists(...) apop_varad_link(apop_table_exists, __VA_ARGS__) #endif int apop_db_open(char const *filename); #ifdef APOP_NO_VARIADIC int apop_db_close(char vacuum) ; #else int apop_db_close_base(char vacuum) ; apop_varad_declare(int, apop_db_close, char vacuum); #define apop_db_close(...) apop_varad_link(apop_db_close, __VA_ARGS__) #endif int apop_query(const char *q, ...) __attribute__ ((format (printf,1,2))); apop_data * apop_query_to_text(const char * fmt, ...) __attribute__ ((format (printf,1,2))); apop_data * apop_query_to_data(const char * fmt, ...) __attribute__ ((format (printf,1,2))); apop_data * apop_query_to_mixed_data(const char *typelist, const char * fmt, ...) __attribute__ ((format (printf,2,3))); gsl_vector * apop_query_to_vector(const char * fmt, ...) __attribute__ ((format (printf,1,2))); double apop_query_to_float(const char * fmt, ...) __attribute__ ((format (printf,1,2))); int apop_data_to_db(const apop_data *set, const char *tabname, char); //////Settings groups //Part I: macros and fns for getting/setting settings groups and elements /** \cond doxy_ignore */ void * apop_settings_get_grp(apop_model *m, char *type, char fail); void apop_settings_remove_group(apop_model *m, char *delme); void apop_settings_copy_group(apop_model *outm, apop_model *inm, char *copyme); void *apop_settings_group_alloc(apop_model *model, char *type, void *free_fn, void *copy_fn, void *the_group); apop_model *apop_settings_group_alloc_wm(apop_model *model, char *type, void *free_fn, void *copy_fn, void *the_group); /** \endcond */ //End of Doxygen ignore. /** Retrieves a settings group from a model. See \ref Apop_settings_get to just pull a single item from within the settings group. This macro returns NULL if a group of type \c type_settings isn't found attached to model \c m, so you can easily put it in a conditional like \code if (!apop_settings_get_group(m, "apop_ols")) ... \endcode \param m An \ref apop_model \param type A string giving the type of the settings group you are retrieving. E.g., for an \ref apop_mle_settings group, use only \c apop_mle. \return A void pointer to the desired struct (or \c NULL if not found). */ #define Apop_settings_get_group(m, type) apop_settings_get_grp(m, #type, 'c') /** Removes a settings group from a model's list. \li If the so-named group is not found, do nothing. */ #define Apop_settings_rm_group(m, type) apop_settings_remove_group(m, #type) /** Add a settings group. The first two arguments (the model you are attaching to and the settings group name) are mandatory, and then you can use the \ref designated syntax to specify default values (if any). \return A pointer to the newly-prepped group. See \ref modelsettings or \ref maxipage for examples. \li If a settings group of the given type is already attached to the model, the previous version is removed. Use \ref Apop_settings_get to check whether a group of the given type is already attached to a model, and \ref Apop_settings_set to modify an existing group. */ #define Apop_settings_add_group(model, type, ...) \ apop_settings_group_alloc(model, #type, type ## _settings_free, type ## _settings_copy, type ##_settings_init ((type ## _settings) {__VA_ARGS__})) /** Copy a model and add a settings group. Useful for models that require a settings group to function. See \ref Apop_settings_add_group. \return A pointer to the newly-prepped model. */ #define apop_model_copy_set(model, type, ...) \ apop_settings_group_alloc_wm(apop_model_copy(model), #type, type ## _settings_free, type ## _settings_copy, type ##_settings_init ((type ## _settings) {__VA_ARGS__})) /** This is the complement to \ref apop_model_set_parameters, for those models that are set up by adding settings group, rather than filling in a list of parameters. For example, the \ref apop_kernel_density model is built by adding a \ref apop_kernel_density_settings group. From the example on the \ref apop_kernel_density page: \code apop_model *k2 = apop_model_set_settings(apop_kernel_density, .base_data=d, .set_fn = set_uniform_edges, .kernel = apop_uniform); \endcode The name of the model and the settings group to be built must match, which is the case for many model transformations, including \ref apop_dconstrain and \ref apop_cross. If the names do not match, use \ref apop_model_copy_set. */ #define Apop_model_set_settings(model, ...) \ apop_settings_group_alloc_wm(apop_model_copy(model), #model, model ## _settings_free, model ## _settings_copy, model ##_settings_init ((model ## _settings) {__VA_ARGS__})) #define apop_model_set_settings Apop_model_set_settings /** Retrieves a setting from a model. See \ref Apop_settings_get_group to pull the entire group. \param model An \ref apop_model. \param type A string giving the type of the settings group you are retrieving, without the \c _settings ending. E.g., for an \ref apop_mle_settings group, use \c apop_mle. \param setting The struct element you want to retrieve. */ #define Apop_settings_get(model, type, setting) \ (((type ## _settings *) apop_settings_get_grp(model, #type, 'f'))->setting) /** Modifies a single element of a settings group to the given value. \li If model==NULL, fails silently. \li If model!=NULL but the given settings group is not found attached to the model, set model->error='s'. */ #define Apop_settings_set(model, type, setting, data) \ do { \ if (!(model)) continue; /* silent fail. */ \ type ## _settings *apop_tmp_settings = apop_settings_get_grp(model, #type, 'c'); \ Apop_stopif(!apop_tmp_settings, (model)->error='s', 0, "You're trying to modify a setting in " \ #model "'s setting group of type " #type " but that model doesn't have such a group."); \ apop_tmp_settings->setting = (data); \ } while (0); /** \cond doxy_ignore */ #define Apop_settings_add Apop_settings_set #define APOP_SETTINGS_ADD Apop_settings_set #define apop_settings_set Apop_settings_set #define APOP_SETTINGS_GET Apop_settings_get #define apop_settings_get Apop_settings_get #define APOP_SETTINGS_ADD_GROUP Apop_settings_add_group #define apop_settings_add_group Apop_settings_add_group #define APOP_SETTINGS_GET_GROUP Apop_settings_get_group #define apop_settings_get_group Apop_settings_get_group #define APOP_SETTINGS_RM_GROUP Apop_settings_rm_group #define apop_settings_rm_group Apop_settings_rm_group #define Apop_model_copy_set apop_model_copy_set //deprecated: #define Apop_model_add_group Apop_settings_add_group /** \endcond */ //End of Doxygen ignore. /** Put this in your header file to declare the init, copy, and free functions for ysg_settings. Of course, these functions will also have to be defined in a .c file using \ref Apop_settings_init, \ref Apop_settings_copy, and \ref Apop_settings_free. */ #define Apop_settings_declarations(ysg) \ ysg##_settings * ysg##_settings_init(ysg##_settings); \ void * ysg##_settings_copy(ysg##_settings *); \ void ysg##_settings_free(ysg##_settings *); /** A convenience macro for declaring the initialization function for a new settings group. See \ref settingswriting for details and an example. */ #define Apop_settings_init(name, ...) \ name##_settings *name##_settings_init(name##_settings in) { \ name##_settings *out = malloc(sizeof(name##_settings)); \ *out = in; \ __VA_ARGS__; \ return out; \ } /** \cond doxy_ignore */ #define Apop_varad_set(var, value) (out)->var = (in).var ? (in).var : (value); /** \endcond */ /** A convenience macro for declaring the copy function for a new settings group. See \ref settingswriting for details and an example. */ #define Apop_settings_copy(name, ...) \ void * name##_settings_copy(name##_settings *in) {\ name##_settings *out = malloc(sizeof(name##_settings)); \ *out = *in; \ __VA_ARGS__; \ return out; \ } /** A convenience macro for declaring the delete function for a new settings group. See \ref settingswriting for details and an example. */ #define Apop_settings_free(name, ...) \ void name##_settings_free(name##_settings *in) {\ __VA_ARGS__; \ free(in); \ } //Part II: the details of extant settings groups. /** The settings for maximum likelihood estimation (including simulated annealing). */ typedef struct{ double *starting_pt; /**< An array of doubles (e.g., (double*){2,4,6,8}) suggesting a starting point. If NULL, use an all-ones vector. If \c startv is a \c gsl_vector and is not a view of a matrix, use .starting_pt=startv->data.*/ char *method; /**< The method to be used for the optimization. All strings are case-insensitive. Name Notes If Newton proposes stepping outside of a certain interval, use an alternate method. See the GSL manual for discussion.
String
"NM simplex" Nelder-Mead simplex Does not use gradients at all. Can sometimes get stuck.
"FR cg" Conjugate gradient (Fletcher-Reeves) (default) CG methods use derivatives. The converge to the optimum of a quadratic function in one step; performance degrades as the objective digresses from quadratic.
"BFGS cg" Broyden-Fletcher-Goldfarb-Shanno conjugate gradient
"PR cg" Polak-Ribiere conjugate gradient
"Annealing" \ref simanneal "simulated annealing" Slow but works for objectives of arbitrary complexity, including stochastic objectives.
"Newton" Newton's method Search by finding a root of the derivative. Expects that gradient is reasonably well-behaved.
"Newton hybrid" Newton's method/gradient descent hybrid Find a root of the derivative via the Hybrid method
"Newton hybrid no scale" Newton's method/gradient descent hybrid with spherical scale As above, but use a simplified trust region.
*/ double step_size, /**< The initial step size. */ tolerance, /**< The precision the minimizer uses in its stopping rule. Only vaguely related to the precision of the actual MLE.*/ delta; int max_iterations; /**< Ignored by simulated annealing. Other methods halt (and set the \c "status" element of the output estimate's info page) if they do this many iterations without finding an optimum. */ int verbose; /**< Give status updates as we go. This is orthogonal to the apop_opts.verbose setting. */ double dim_cycle_tolerance; /**< If zero (the default), the usual procedure. If \f$>0\f$, cycle across dimensions: fix all but the first dimension at the starting point, optimize only the first dim. Then fix the all but the second dim, and optimize the second dim. Continue through all dims, until the log likelihood at the outset of one cycle through the dimensions is within this amount of the previous cycle's log likelihood. There will be at least two cycles. */ //simulated annealing (also uses step_size); int n_tries, iters_fixed_T; double k, t_initial, mu_t, t_min ; gsl_rng *rng; apop_data **path; /**< If not \c NULL, record each vector tried by the optimizer as one row of this \ref apop_data set. Each row of the \c matrix element holds the vector tried; the corresponding element in the \c vector is the evaluated value at that vector (after out-of-constraints penalties have been subtracted). A new \ref apop_data set is allocated at the pointer you send in. This data set has no names; add them as desired. For a sample use, see \ref maxipage. */ } apop_mle_settings; /** Settings for least-squares type models such as \ref apop_ols or \ref apop_iv */ typedef struct { int destroy_data; /**< If \c 'y', then the input data set may be normalized or otherwise mangled. */ apop_data *instruments; /**< Use for the \ref apop_iv regression, qv. */ char want_cov; /**< Deprecated. Please use \ref apop_parts_wanted_settings. */ char want_expected_value; /**< Deprecated. Please use \ref apop_parts_wanted_settings. */ apop_model *input_distribution; /**< The distribution of \f$P(Y|X)\f$ is specified by the model holding this struct, but the distribution of \f$X\f$ needs to be specified as well for any calculation of \f$P(Y)\f$. See the notes in the RNG section of the \ref apop_ols documentation. */ } apop_lm_settings; /** The default is for the estimation routine to give some auxiliary information, such as a covariance matrix, predicted values, and common hypothesis tests. Some uses of a model depend on these items, but if they are a waste of time for your purposes, this settings group gives a quick way to bypass them all. Adding this settings group to your model without changing any default values--- \code Apop_model_add_group(your_model, apop_parts_wanted); \endcode ---will turn off all of the auxiliary calculations covered, because the default value for all the switches is 'n', indicating that all elements are not wanted. From there, you can change some of the default 'n's to 'y's to retain some but not all auxiliary elements. If you just want the parameters themselves and the covariance matrix: \code Apop_model_add_group(your_model, apop_parts_wanted, .covariance='y'); \endcode \li Not all models support this, although the models with especially compute-intensive auxiliary info do (e.g., the maximum likelihood estimation system). Check the model's documentation. \li Tests may depend on covariance, so .covariance='n', .tests='y' may be treated as .covariance='y', .tests='y'. */ typedef struct { //init/copy/free are in apop_mle.c char covariance; /*< If 'y', calculate the covariance matrix. Default 'n'. */ char predicted;/*< If 'y', calculate the predicted values. This is typically as many items as rows in your data set. Default 'n'. */ char tests;/*< If 'y', run any hypothesis tests offered by the model's estimation routine. Default 'n'. */ char info;/*< If 'y', add an info table with elements such as log likelihood or AIC. Default 'n'. */ } apop_parts_wanted_settings; /** For use by \ref apop_cdf when the CDF is generated via Monte Carlo methods. */ typedef struct { int draws; /**< For random draw methods, how many draws? Default: 10,000.*/ gsl_rng *rng; /**< For random draw methods. See \ref apop_rng_get_thread on the default. */ gsl_matrix *draws_made; /**< A store of random draws used to calcuate the CDF. Need only be generated once, and so stored here. */ int *draws_refcount; /**< For internal use.*/ } apop_cdf_settings; /** Settings for getting parameter models (i.e. the distribution of parameter estimates) */ typedef struct { apop_model *base; int index; gsl_rng *rng; int draws; } apop_pm_settings; /** Settings to accompany the \ref apop_pmf. */ typedef struct { gsl_vector *cmf; /**< A cumulative mass function, for the purposes of making random draws.*/ char draw_index; /**< If \c 'y', then draws from the PMF return the integer index of the row drawn. If \c 'n' (the default), then return the data in the vector/matrix elements of the data set. */ long double total_weight; /**< Keep the total weight, in case the input weights aren't normalized to sum to one. */ int *cmf_refct; /**< For internal use, so I can garbage-collect the CMF when needed. */ } apop_pmf_settings; /** Settings for the \ref apop_kernel_density model. */ typedef struct{ apop_data *base_data; /**< The data that will be smoothed by the KDE. */ apop_model *base_pmf; /**< I actually need the data in a \ref apop_pmf. You can give that to me explicitly, or I can wrap the .base_data in a PMF. */ apop_model *kernel; /**< The distribution to be centered over each data point. Default, \ref apop_normal with std dev 1. */ void (*set_fn)(apop_data*, apop_model*); /**< The function I will use for each data point to center the kernel over each point. Default: set the upper-left element of the parameter set to the upper-left scalar in the data: apop_data_set(m->parameters, .val= apop_data_get(in));. */ int own_pmf, own_kernel; /**< For internal use only. */ }apop_kernel_density_settings; struct apop_mcmc_settings; /** A proposal distribution for \ref apop_mcmc_settings and its accompanying functions and information. By default, these will be \ref apop_multivariate_normal models. The \c step_fn and \c adapt_fn have to be written around the model and your preferences. For the defaults, the step function recenters the mean of the distribution around the last accepted proposal, and the adapt function widens \f$\Sigma\f$ for the Normal if the accept rate is too low; narrows it if the accept rate is too large. You may provide an array of proposals. The length of the list of proposals must match the number of chunks, as per the \c gibbs_chunks setting in the \ref apop_mcmc_settings group that the array of proposals is a part of. Each proposal must be initialized to include all elements, and the step and adapt functions probably have to be written anew for each type of model. */ typedef struct apop_mcmc_proposal_s { apop_model *proposal; /**< The distribution from which test parameters will be drawn. After getting the draw using the \c draw method of the proposal, the base model's \c parameters element is filled using \ref apop_data_fill. If \c NULL, \ref apop_model_metropolis will use a Multivariate Normal with the appropriate dimension, mean zero, and covariance matrix I. If not \c NULL, be sure to parameterize your model with an initial position. */ void (*step_fn)(double const *, struct apop_mcmc_proposal_s*, struct apop_mcmc_settings *); /**< Modifies the parameters of the proposal distribution given a successful draw. Typically, this function writes the drawn data point to the parameter set. If the draw is a scalar, the default function sets the 0th element of the model's \c parameter set with the draw (works for the \ref apop_normal and other models). If the draw has multiple dimensions, they are all copied to the parameter set, which must have the same size. */ int (*adapt_fn)(struct apop_mcmc_proposal_s *ps, struct apop_mcmc_settings *ms); /**< Called every step, to adapt the proposal distribution using information to this point in the chain. */ int accept_count, reject_count; /**< If there are multiple \ref apop_mcmc_proposal_s structs for multiple chunks, These count accepts/rejects for this chunk. The \ref apop_mcmc_settings group has a total for the aggregate across all chunks. */ } apop_mcmc_proposal_s; /** Method settings for a model to be put through Bayesian updating. */ typedef struct apop_mcmc_settings { apop_data *data; long int periods; /**< For how many steps should the MCMC chain run? */ double burnin; /**< What percentage of the periods should be ignored as initialization. That is, this is a number between zero and one. */ int histosegments; /**< If outputting a binned PMF, how many segments should it have? */ double last_ll; /**< If you have already run MCMC, the last log likelihood in the chain.*/ apop_model *pmf; /**< If you have already run MCMC, I keep a pointer to the model so far here. Use \ref apop_model_metropolis_draw to get one more draw.*/ apop_model *base_model; /**< The model you provided with a \c log_likelihood or \c p element (which need not sum to one). You do not have to set this: if it is \c NULL on input to \ref apop_model_metropolis, I will fill it in.*/ apop_mcmc_proposal_s *proposals; /**< The list of proposals. You can probably use the default of adaptive multivariate normals. See the \ref apop_mcmc_proposal_s struct for details. */ int proposal_count; /**< The number of proposal sets; see \c gibbs_chunks below. */ double target_accept_rate; /**< The desired acceptance rate, for use by adaptive proposals. Default: .35 */ int accept_count; /**< After calling \ref apop_model_metropolis, this will have the number of accepted proposals.*/ int reject_count; /**< After calling \ref apop_model_metropolis, this will have the number of rejected proposals.*/ char gibbs_chunks; /**< See the \ref apop_model_metropolis documentation for discussion. \c 'a': One step draws and accepts/rejects all parameters as a unit
\c 'b': draw in blocks: the vector is a block, the matrix is a separate block, the weights are a separate block, and so on through every page of the model parameters. Each block of parameters is drawn and accepted/rejected as a unit.
\c '1': draw each parameter and accept/reject separately. One MCMC step consists of a set of draws for every parameter.
*/ size_t *block_starts; /**< For internal use */ int block_count, proposal_is_cp; /**< For internal use. */ char start_at; /**< If \c '1' (the default), start with a first proposal of all 1s. Even when this is a far-from-useful starting point, MCMC typically does a good job of crawling to better spots early in the chain.
The default when this is unset is to start at the \c parameters of the \ref apop_model sent in to \ref apop_model_metropolis.*/ void (*base_step_fn)(double const *, struct apop_mcmc_proposal_s*, struct apop_mcmc_settings *); /**< If an \ref apop_mcmc_proposal_s struct has \c NULL \c step_fn, use this. If you don't want a step function, set this to a do-nothing function. */ int (*base_adapt_fn)(struct apop_mcmc_proposal_s *ps, struct apop_mcmc_settings *ms); /**< If a \ref apop_mcmc_proposal_s has \c NULL \c adapt_fn, use this. If you don't want an adapt function, set this to a do-nothing function.*/ } apop_mcmc_settings; /** \cond doxy_ignore */ //Loess, including the old FORTRAN-to-C. struct loess_struct { struct { long n, p; double *y, *x; double *weights; } in; struct { double span; long degree; long normalize; long parametric[8]; long drop_square[8]; char *family; } model; struct { char *surface; char *statistics; double cell; char *trace_hat; long iterations; } control; struct { long *parameter, *a; double *xi, *vert, *vval; } kd_tree; struct { double *fitted_values; double *fitted_residuals; double enp, s; double one_delta, two_delta; double *pseudovalues; double trace_hat; double *diagonal; double *robust; double *divisor; } out; }; /** \endcond */ //End of Doxygen ignore. /** The code for the loess system is based on FORTRAN code from 1988, overhauled in 1992, linked in to Apophenia in 2009. The structure that does all the work, then, is a \c loess_struct that you should basically take as opaque. The useful settings from that struct re-appear in the \ref apop_loess_settings struct so you can set them directly, and then the settings init function will copy your preferences into the working struct. The documentation for the elements is cut/pasted/modified from Cleveland, Grosse, and Shyu. */ typedef struct { apop_data *data; struct loess_struct lo_s; /**< .data: Mandatory. Your input data set. .lo_s.model.span: smoothing parameter. Default is 0.75. .lo_s.model.degree: overall degree of locally-fitted polynomial. 1 is locally-linear fitting and 2 is locally-quadratic fitting. Default is 2. .lo_s.normalize: Should numeric predictors be normalized? If \c 'y' - the default - the standard normalization is used. If \c 'n', no normalization is carried out. \c .lo_s.model.parametric: for two or more numeric predictors, this argument specifies those variables that should be conditionally-parametric. The argument should be a logical vector of length \c p, specified in the order of the predictor group ordered in \c x. Default is a vector of 0's of length \c p. \c .lo_s.model.drop_square: for cases with degree = 2, and with two or more numeric predictors, this argument specifies those numeric predictors whose squares should be dropped from the set of fitting variables. The method of specification is the same as for parametric. Default is a vector of 0's of length p. \c .lo_s.model.family: the assumed distribution of the errors. The values may be "gaussian" or "symmetric". The first value is the default. If the second value is specified, a robust fitting procedure is used. \c lo_s.control.surface: determines whether the fitted surface is computed "directly" at all points or whether an "interpolation" method is used. The default, interpolation, is what most users should use unless special circumstances warrant. \c lo_s.control.statistics: determines whether the statistical quantities are computed "exactly" or approximately, where "approximate" is the default. The former should only be used for testing the approximation in statistical development and is not meant for routine usage because computation time can be horrendous. \c lo_s.control.cell: if interpolation is used to compute the surface, this argument specifies the maximum cell size of the k-d tree. Suppose k = floor(n*cell*span) where n is the number of observations. Then a cell is further divided if the number of observations within it is greater than or equal to k. default=0.2 \c lo_s.control.trace_hat: Options are "approximate", "exact", and "wait.to.decide". When lo_s.control.surface is "approximate", determines the computational method used to compute the trace of the hat matrix, which is used in the computation of the statistical quantities. If "exact", an exact computation is done; normally this goes quite fast on the fastest machines until n, the number of observations is 1000 or more, but for very slow machines, things can slow down at n = 300. If "wait.to.decide" is selected, then a default is chosen in loess(); the default is "exact" for n < 500 and "approximate" otherwise. If surface is "exact", an exact computation is always done for the trace. Set trace_hat to "approximate" for large dataset will substantially reduce the computation time. \c lo_s.model.iterations: if family is "symmetric", the number of iterations of the robust fitting method. Default is 0 for lo_s.model.family = gaussian; 4 for family=symmetric. That's all you can set. Here are some output parameters: \c fitted_values: fitted values of the local regression model \c fitted_residuals: residuals of the local regression fit \c enp: equivalent number of parameters. \c s: estimate of the scale of the residuals. \c one_delta: a statistical parameter used in the computation of standard errors. \c two_delta: a statistical parameter used in the computation of standard errors. \c pseudovalues: adjusted values of the response when robust estimation is used. \c trace_hat: trace of the operator hat matrix. \c diagonal: diagonal of the operator hat matrix. \c robust: robustness weights for robust fitting. \c divisor: normalization divisor for numeric predictors. */ int want_predict_ci; /**< If \c 'y' (the default), calculate the confidence bands for predicted values */ double ci_level; /**< If running a prediction, the level at which to calculate the confidence interval. default: 0.95 */ } apop_loess_settings; /** \cond doxy_ignore */ typedef struct point { /* a point in the x,y plane */ double x,y; /* x and y coordinates */ double ey; /* exp(y-ymax+YCEIL) */ double cum; /* integral up to x of rejection envelope */ int f; /* is y an evaluated point of log-density */ struct point *pl,*pr; /* envelope points to left and right of x */ } POINT; /* This includes the envelope info and the metropolis steps. */ typedef struct { /* attributes of the entire rejection envelope */ int cpoint; /* number of POINTs in current envelope */ int npoint; /* max number of POINTs allowed in envelope */ double ymax; /* the maximum y-value in the current envelope */ POINT *p; /* start of storage of envelope POINTs */ double *convex; /* adjustment for convexity */ double metro_xprev; /* previous Markov chain iterate */ double metro_yprev; /* current log density at xprev */ } arms_state; /** \endcond */ /** For use with \ref apop_arms_draw, to perform derivative-free adaptive rejection sampling with metropolis step. That function generates default values for this struct if you do not attach one to the model beforehand, via a form like apop_model_add_group(your_model, apop_arms, .model=your_model, .xl=8, .xr =14);. If you initialize it manually via \ref apop_settings_add_group, the \c model element is mandatory; you'll get a run-time complaint if you forget it. */ typedef struct { double *xinit; /**< A double* giving starting values for x in ascending order, e.g., (double *){1, 10, 100}. . Default: -1, 0, 1. If this isn't \c NULL, I need at least three items, and the length in \c ninit. */ double xl; /**< Left bound. If you don't give me one, I'll use min[min(xinit)/10, min(xinit)*10].*/ double xr; /**< Right bound. If you don't give me one, I'll use max[max(xinit)/10, max(xinit)*10]. */ double convex; /**< Adjustment for convexity */ int ninit; /**< The length of \c xinit.*/ int npoint; /**< Maximum number of envelope points. I \c malloc space for this many doubles at the outset. Default = 1e5. */ char do_metro; /**< Set to \c 'y' if the metropolis step is required (i.e., if you're not sure if the function is log-concave).*/ double xprev; /**< For internal use; please ignore. Previous value from Markov chain. */ int neval; /**< On exit, the number of function evaluations performed */ arms_state *state; apop_model *model; /**< The model from which to draw. Mandatory. Must have either a \c log_likelihood or \c p method.*/ } apop_arms_settings; /** The settings to accompany the \ref apop_cross model, representing the cross product of two models (or, via recursion, a list of models of arbitrary length).*/ typedef struct { char *splitpage; /**< The name of the page at which to split the data. If \c NULL, I send the entire data set to both models as needed. */ apop_model *model1; /**< The first model in the stack.*/ apop_model *model2; /**< The second model.*/ } apop_cross_settings; typedef struct { apop_data *(*base_to_transformed)(apop_data*); /**< The function to transform the model from pre-transform space to post-transform space. */ apop_data *(*transformed_to_base)(apop_data*); /**< The function to transform from post-transform space back to pre-transform space. If this function does not exist, using a Jacobian-based transformation is probably not mathematically correct. */ double (*jacobian_to_base)(apop_data*); /**< The derivative of the \c transformed_to_base function. */ apop_model *base_model; /**< The pre-transformation model. */ } apop_coordinate_transform_settings;/**< Settings for an \ref apop_coordinate_transform model; see its documentation for notes and an example. */ /** For use with the \ref apop_dconstrain model. See its documentation for an example. */ typedef struct { apop_model *base_model; /**< The model, before constraint. */ double (*constraint)(apop_data *, apop_model *); /**< The constraint. Return 1 if the data is in the constraint; zero if out. */ double (*scaling)(apop_model *); /**< Optional. Return the percent of the model density inside the constraint. */ gsl_rng *rng; /**< If you don't provide a \c scaling function, I calculate the in-constraint model density via random draws. If no \c rng is provided, I use a default RNG; see \ref apop_rng_get_thread. */ int draw_ct; /**< How many draws to make for calculating the in-constraint model density via random draws. Current default: 1e4. */ } apop_dconstrain_settings; typedef struct { apop_model *generator_m; apop_model *ll_m; int draw_ct; } apop_composition_settings;/**< All of the elements of this struct should be considered private.*/ /** For mixture distributions, typically set up using \ref apop_model_mixture. See \ref apop_mixture for discussion. Please consider all elements but \c model_list and \c weights as private and subject to change. See the examples for use of these elements. */ typedef struct { gsl_vector *weights; /**< The likelihood of a draw from each component. */ apop_model **model_list; /**< A \c NULL-terminated list of component models. */ int model_count; int *param_sizes; /**< The number of parameters for each model. Useful for unpacking the params. */ apop_model *cmf; /**< For internal use by the draw method. */ int *cmf_refct; /**< For internal use, so I can garbage-collect the CMF when needed. */ } apop_mixture_settings; //Models built via call to apop_model_copy_set. #define apop_model_dcompose(...) Apop_model_set_settings(apop_composition, __VA_ARGS__) #define apop_model_dconstrain(...) Apop_model_set_settings(apop_dconstrain, __VA_ARGS__) #define apop_model_coordinate_transform(...) Apop_model_set_settings(apop_coordinate_transform, __VA_ARGS__) //Doxygen drops whatever is after these declarations, so I put them last. Apop_settings_declarations(apop_lm) Apop_settings_declarations(apop_pm) Apop_settings_declarations(apop_pmf) Apop_settings_declarations(apop_mle) Apop_settings_declarations(apop_cdf) Apop_settings_declarations(apop_arms) Apop_settings_declarations(apop_mcmc) Apop_settings_declarations(apop_loess) Apop_settings_declarations(apop_cross) Apop_settings_declarations(apop_mixture) Apop_settings_declarations(apop_dconstrain) Apop_settings_declarations(apop_composition) Apop_settings_declarations(apop_parts_wanted) Apop_settings_declarations(apop_kernel_density) Apop_settings_declarations(apop_coordinate_transform) #ifdef __cplusplus } #endif /** @} */ //End doxygen's all_public grouping //Part of the intent of a convenience header like this is that you //don't have to remember what else you're including. So here are //some other common GSL headers: #include #include #include #include #include #include apophenia-1.0+ds/apop_arms.c000066400000000000000000000455301262736346100160660ustar00rootroot00000000000000/** \file adaptive rejection metropolis sampling */ /** (C) Wally Gilks; see documentation below for details. Adaptations for Apophenia (c) 2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #define XEPS 0.00001 /* critical relative x-value difference */ #define YEPS 0.1 /* critical y-value difference */ #define EYEPS 0.001 /* critical relative exp(y) difference */ #define YCEIL 50. /* maximum y avoiding overflow in exp(y) */ /* declarations for functions defined in this file (minus those in arms.h). */ void invert(double prob, arms_state *env, POINT *p); int test(arms_state *state, POINT *p, apop_arms_settings *params, gsl_rng *r); int update(arms_state *state, POINT *p, apop_arms_settings *params); static void cumulate(arms_state *env); int meet (POINT *q, arms_state *state, apop_arms_settings *params); double area(POINT *q); double expshift(double y, double y0); double logshift(double y, double y0); double perfunc(apop_arms_settings*, double x); void display(FILE *f, arms_state *env, apop_arms_settings *); int initial (apop_arms_settings* params, arms_state *state); Apop_settings_copy(apop_arms, out->state = malloc(sizeof(arms_state)); *out->state = *in->state; ) Apop_settings_free(apop_arms, if (in->state){ free(in->state->p); free(in->state); } ) Apop_settings_init(apop_arms, if ((in.xl || in.xr) && !in.xinit) out->xinit = (double []) {in.xl+GSL_DBL_EPSILON, (in.xl+in.xr)/2., in.xr-GSL_DBL_EPSILON}; else{ //Apop_varad_set(xinit, ((double []) {0, 0.5, 1})); Apop_varad_set(xinit, ((double []) {-1, 0, 1})); } Apop_varad_set(ninit, 3); Apop_varad_set(xl, GSL_MIN(out->xinit[0]/10., out->xinit[0]*10)-.1); Apop_varad_set(xr, GSL_MAX(out->xinit[out->ninit-1]/10., out->xinit[out->ninit-1]*10)+.1); Apop_varad_set(convex, 0); Apop_varad_set(npoint, 100); Apop_varad_set(do_metro, 'y'); Apop_varad_set(xprev, (out->xinit[0]+out->xinit[out->ninit-1])/2.); Apop_varad_set(neval, 1000); Apop_assert(out->model, "the model input (e.g.: .model = parent_model) is mandatory."); // allocate the state out->state = malloc(sizeof(arms_state)); Apop_assert(out->state, "Malloc failed. Out of memory?"); *out->state = (arms_state) { }; int err = initial(out, out->state); Apop_assert_c(!err, NULL, 0, "init failed, error %i. Returning NULL", err); /* finish setting up metropolis struct (can only do this after setting up env) */ if(out->do_metro=='y'){ /* I don't understand why this is needed. if((params->xprev < params->xl) || (params->xprev > params->xr)) apop_assert(0, 1007, 0, 's', "previous Markov chain iterate out of range")*/ out->state->metro_xprev = out->xprev; out->state->metro_yprev = perfunc(out,out->xprev); assert(isfinite(out->state->metro_xprev)); assert(isfinite(out->state->metro_yprev)); } ) void distract_doxygen_arms(){/*Doxygen gets thrown by the settings macros. This decoy function is a workaround. */} /** Adaptive rejection Metropolis sampling, to make random draws from a univariate distribution. The author, Wally Gilks, explains on http://www.amsta.leeds.ac.uk/~wally.gilks/adaptive.rejection/web_page/Welcome.html , that ``ARS works by constructing an envelope function of the log of the target density, which is then used in rejection sampling (see, for example, Ripley, 1987). Whenever a point is rejected by ARS, the envelope is updated to correspond more closely to the true log density, thereby reducing the chance of rejecting subsequent points. Fewer ARS rejection steps implies fewer point-evaluations of the log density.'' \li It accepts only functions with univariate inputs. I.e., it will put a single value into a 1x1 \ref apop_data set, and then evaluate the log likelihood at that point. For multivariate situations, see \ref apop_model_metropolis. \li It is currently the default for the \ref apop_draw function given a univariate model, so you can just call that if you prefer. \li There are a great number of parameters, in the \c apop_arms_settings structure. The structure also holds a history of the points tested to date. That means that the system will be more accurate as more draws are made. It also means that if the parameters change, or you use \ref apop_model_copy, you should call Apop_settings_rm_group(your_model, apop_arms) to clear the model of points that are not valid for a different situation. */ int apop_arms_draw (double *out, gsl_rng *r, apop_model *m){ apop_arms_settings *params = Apop_settings_get_group(m, apop_arms); if (!params) params = Apop_model_add_group(m, apop_arms, .model=m); POINT pwork; /* a working point, not yet incorporated in envelope */ int msamp=0; /* the number of x-values currently sampled */ arms_state *state = params->state; /* now do adaptive rejection */ do { // Sample a new point from piecewise exponential envelope double prob = gsl_rng_uniform(r); /* get x-value correponding to a cumulative probability prob */ assert(isfinite(state->p->x)); assert(isfinite(state->p->y)); invert(prob,state,&pwork); /* perform rejection (and perhaps metropolis) tests */ int i = test(state,&pwork, params, r); if (i == 1){ // point accepted Apop_notify(3, " point accepted."); *out = pwork.x; assert(isfinite(pwork.x)); return 0; } else Apop_stopif(i!=0, return 1,-5, "envelope error - violation without metropolis"); msamp ++; Apop_notify(3, " point rejected."); } while (msamp < 1e3); Apop_notify(1, "I just rejected 1,000 samples. Something is wrong."); return 0; } int initial (apop_arms_settings* params, arms_state *env){ // to set up initial envelope POINT *q; int mpoint = 2*params->ninit + 1; Apop_assert_c(params->ninit>=3, 1001, 0, "too few initial points"); Apop_assert_c(params->npoint >= mpoint, 1002, 0, "too many initial points"); Apop_assert_c((params->xinit[0] >= params->xl) && (params->xinit[params->ninit-1] <= params->xr), 1003, 0, "initial points do not satisfy bounds"); for(int i=1; ininit; i++) Apop_assert_c(params->xinit[i] > params->xinit[i-1], 1004, 0, "data not ordered"); Apop_assert_c(params->convex >= 0.0, 1008, 0, "negative convexity parameter"); env->convex = ¶ms->convex; // copy convexity address to env params->neval = 0; // initialise current number of function evaluations /* set up space for envelope POINTs */ env->npoint = params->npoint; env->p = malloc(params->npoint*sizeof(POINT)); Apop_assert(env->p, "malloc of env->p failed. Out of space?"); /* set up envelope POINTs */ q = env->p; q->x = params->xl; // left bound q->f = 0; q->pl = NULL; q->pr = q+1; for(int j=1, k=0; jx = params->xinit[k++]; q->y = perfunc(params,q->x); Apop_assert(isfinite(q->x), "the initial param is %g", q->x); Apop_assert(isfinite(q->y), "f(an initial parameter)= %g", q->y); q->f = 1; } else // intersection point q->f = 0; q->pl = q-1; q->pr = q+1; } /* right bound */ q++; q->x = params->xr; q->f = 0; q->pl = q-1; q->pr = NULL; assert(isfinite(q->x)); /* calculate intersection points */ q = env->p; for (int j=0; jcpoint = mpoint; // note number of POINTs currently in envelope return 0; } void invert(double prob, arms_state *env, POINT *p){ /* to obtain a point corresponding to a given cumulative probability prob : cumulative probability under envelope *env : envelope attributes *p : a working POINT to hold the sampled value */ double u,xl=0,xr=0,yl,yr,eyl,eyr,prop; /* find rightmost point in envelope */ POINT *q = env->p; while(q->pr != NULL)q = q->pr; /* find exponential piece containing point implied by prob */ u = prob * q->cum; while(q->pl->cum > u)q = q->pl; /* piece found: set left and right POINTs of p, etc. */ p->pl = q->pl; p->pr = q; p->f = 0; p->cum = u; /* calculate proportion of way through integral within this piece */ prop = (u - q->pl->cum) / (q->cum - q->pl->cum); /* get the required x-value */ if (q->pl->x == q->x){ /* interval is of zero length */ p->x = q->x; p->y = q->y; p->ey = q->ey; } else { xl = q->pl->x; xr = q->x; yl = q->pl->y; yr = q->y; eyl = q->pl->ey; eyr = q->ey; if(fabs(yr - yl) < YEPS){ /* linear approximation was used in integration in function cumulate */ if(fabs(eyr - eyl) > EYEPS*fabs(eyr + eyl)) p->x = xl + ((xr - xl)/(eyr - eyl)) * (-eyl + sqrt((1. - prop)*eyl*eyl + prop*eyr*eyr)); else p->x = xl + (xr - xl)*prop; p->ey = ((p->x - xl)/(xr - xl)) * (eyr - eyl) + eyl; p->y = logshift(p->ey, env->ymax); } else { /* piece was integrated exactly in function cumulate */ p->x = xl + ((xr - xl)/(yr - yl)) * (-yl + logshift(((1.-prop)*eyl + prop*eyr), env->ymax)); p->y = ((p->x - xl)/(xr - xl)) * (yr - yl) + yl; p->ey = expshift(p->y, env->ymax); } } assert(isfinite(p->x)); assert(isfinite(p->y)); assert(isfinite(q->x)); assert(isfinite(q->y)); /* guard against imprecision yielding point outside interval */ Apop_stopif( ((p->x < xl) || (p->x > xr)), return,-5, "imprecision yields point outside interval"); } int test(arms_state *env, POINT *p, apop_arms_settings *params, gsl_rng *r){ /* to perform rejection, squeezing, and metropolis tests *env : state data *p : point to be tested */ assert(p->pl && p->pr); double u,y,ysqueez,ynew,yold,znew,zold,w; POINT *ql,*qr; /* for rejection test */ u = gsl_rng_uniform(r) * p->ey; y = logshift(u,env->ymax); if(params->do_metro !='y' && (p->pl->pl != NULL) && (p->pr->pr != NULL)){ /* perform squeezing test */ ql = p->pl->f ? p->pl : p->pl->pl; qr = p->pr->f ? p->pr : p->pr->pr; ysqueez = (qr->y * (p->x - ql->x) + ql->y * (qr->x - p->x)) /(qr->x - ql->x); if(y <= ysqueez) // accept point at squeezing step return 1; } /* evaluate log density at point to be tested */ ynew = perfunc(params,p->x); assert(isfinite(p->x)); assert(p->pl && p->pr); Apop_notify(3, "tested (%g, %g); ", p->x, ynew); /* perform rejection test */ if(params->do_metro != 'y' || (params->do_metro == 'y' && (y >= ynew))){ /* update envelope */ p->y = ynew; p->ey = expshift(p->y,env->ymax); p->f = 1; if(update(env,p, params)) Apop_assert_c(0, -1, 0, "envelope violation without metropolis"); /* perform rejection test: accept iff y < ynew */ return (y < ynew); } /* continue with metropolis step */ yold = env->metro_yprev; /* find envelope piece containing metrop->xprev */ ql = env->p; while(ql->pl != NULL) ql = ql->pl; while(ql->pr->x < env->metro_xprev) ql = ql->pr; qr = ql->pr; /* calculate height of envelope at metrop->xprev */ w = (env->metro_xprev - ql->x)/(qr->x - ql->x); zold = ql->y + w*(qr->y - ql->y); znew = p->y; if(yold < zold)zold = yold; if(ynew < znew)znew = ynew; w = ynew-znew-yold+zold; w = GSL_MIN(w, 0.0); w = (w > -YCEIL) ? exp(w) : 0.0; u = gsl_rng_uniform(r); if(u > w){ /* metropolis says don't move, so replace current point with previous */ /* markov chain iterate */ p->x = env->metro_xprev; p->y = env->metro_yprev; Apop_notify(3, "metro step (%g) rejected with w=%g, " "ynew=%g, yold=%g, znew = %g, zold=%g; ", p->x, w, ynew, yold, znew, zold); p->ey = expshift(p->y,env->ymax); assert(isfinite(p->x)); assert(isfinite(p->y)); assert(isfinite(p->ey)); p->f = 1; p->pl = ql; p->pr = qr; } else { /* trial point accepted by metropolis, so update previous markov */ /* chain iterate */ env->metro_xprev = p->x; env->metro_yprev = ynew; } return 1; } int update(arms_state *env, POINT *p, apop_arms_settings *params){ /* to update envelope to incorporate new point on log density *env : state information *p : point to be incorporated */ POINT *m,*ql,*qr,*q; if(!(p->f) || (env->cpoint > env->npoint - 2)) /* y-value has not been evaluated or no room for further points */ return 0; // ignore this point /* copy working POINT p to a new POINT q */ q = env->p + env->cpoint++; q->x = p->x; q->y = p->y; q->f = 1; /* allocate an unused POINT for a new intersection */ m = env->p + env->cpoint++; m->f = 0; if((p->pl->f) && !(p->pr->f)){ /* left end of piece is on log density; right end is not */ /* set up new intersection in interval between p->pl and p */ m->pl = p->pl; m->pr = q; q->pl = m; q->pr = p->pr; m->pl->pr = m; q->pr->pl = q; } else if (!(p->pl->f) && (p->pr->f)){ /* left end of interval is not on log density; right end is */ /* set up new intersection in interval between p and p->pr */ m->pr = p->pr; m->pl = q; q->pr = m; q->pl = p->pl; m->pr->pl = m; q->pl->pr = q; } else Apop_stopif(1, return 1,-5, "unexpected event"); // this should be impossible /* now adjust position of q within interval if too close to an endpoint */ ql = q->pl->pl ? q->pl->pl : q->pl; qr = q->pr->pr ? q->pr->pr : q->pr; if (q->x < (1. - XEPS) * ql->x + XEPS * qr->x){ /* q too close to left end of interval */ q->x = (1. - XEPS) * ql->x + XEPS * qr->x; q->y = perfunc(params,q->x); } else if (q->x > XEPS * ql->x + (1. - XEPS) * qr->x){ /* q too close to right end of interval */ q->x = XEPS * ql->x + (1. - XEPS) * qr->x; q->y = perfunc(params,q->x); } /* revise intersection points */ if(meet(q->pl,env, params) /* envelope violations without metropolis */ || meet(q->pr,env, params) || (q->pl->pl != NULL && meet(q->pl->pl->pl,env, params)) || (q->pr->pr != NULL && meet(q->pr->pr->pr,env, params))) return 1; /* exponentiate and integrate new envelope */ cumulate(env); return 0; } static void cumulate(arms_state *env){ /* to exponentiate and integrate envelope */ /* *env : envelope attributes */ POINT *q,*qlmost; qlmost = env->p; /* find left end of envelope */ while(qlmost->pl) qlmost = qlmost->pl; /* find maximum y-value: search envelope */ env->ymax = qlmost->y; for(q = qlmost->pr; q != NULL; q = q->pr) if(q->y > env->ymax) env->ymax = q->y; /* exponentiate envelope */ for(q = qlmost; q != NULL; q = q->pr) q->ey = expshift(q->y,env->ymax); /* integrate exponentiated envelope */ qlmost->cum = 0.; for(q = qlmost->pr; q != NULL; q = q->pr) q->cum = q->pl->cum + area(q); } int meet (POINT *q, arms_state *env, apop_arms_settings *params){ /* To find where two chords intersect q : to store point of intersection *env : state attributes */ double gl=0,gr=0,grl=0,dl=0,dr=0; int il=0,ir=0,irl=0; Apop_assert(!(q->f), "error 30: this is not an intersection point."); /* calculate coordinates of point of intersection */ if ((q->pl != NULL) && (q->pl->pl->pl != NULL)){ /* chord gradient can be calculated at left end of interval */ gl = (q->pl->y - q->pl->pl->pl->y)/(q->pl->x - q->pl->pl->pl->x); il = 1; } else // no chord gradient on left il = 0; if ((q->pr != NULL) && (q->pr->pr->pr != NULL)){ /* chord gradient can be calculated at right end of interval */ gr = (q->pr->y - q->pr->pr->pr->y)/(q->pr->x - q->pr->pr->pr->x); ir = 1; } else // no chord gradient on right ir = 0; if ((q->pl != NULL) && (q->pr != NULL)){ /* chord gradient can be calculated across interval */ grl = (q->pr->y - q->pl->y)/(q->pr->x - q->pl->x); irl = 1; } else irl = 0; if(irl && il && (gldo_metro !='y') // envelope violation without metropolis return 1; gl = gl + (1.0 + *(env->convex)) * (grl - gl); // adjust left gradient } if(irl && ir && (gr>grl)){ /* convexity on right exceeds current threshold */ if(params->do_metro !='y') // envelope violation without metropolis return 1; gr = gr + (1.0 + *(env->convex)) * (grl - gr); // adjust right gradient } if(il && irl){ dr = (gl - grl) * (q->pr->x - q->pl->x); if(dr < YEPS) // adjust dr to avoid numerical problems dr = YEPS; } if(ir && irl){ dl = (grl - gr) * (q->pr->x - q->pl->x); if(dl < YEPS) // adjust dl to avoid numerical problems dl = YEPS; } if(il && ir && irl){ /* gradients on both sides */ q->x = (dl * q->pr->x + dr * q->pl->x)/(dl + dr); q->y = (dl * q->pr->y + dr * q->pl->y + dl * dr)/(dl + dr); } else if (il && irl){ /* gradient only on left side, but not right hand bound */ q->x = q->pr->x; q->y = q->pr->y + dr; } else if (ir && irl){ /* gradient only on right side, but not left hand bound */ q->x = q->pl->x; q->y = q->pl->y + dl; } else if (il) q->y = q->pl->y + gl * (q->x - q->pl->x); // right hand bound else if (ir) q->y = q->pr->y - gr * (q->pr->x - q->x); // left hand bound else Apop_assert(0, "error 31: gradient on neither side - should be impossible."); if(((q->pl != NULL) && (q->x < q->pl->x)) || ((q->pr != NULL) && (q->x > q->pr->x))){ Apop_assert(0, "error 32: intersection point outside interval (through imprecision)"); } return 0; // successful exit : intersection has been calculated } double area(POINT *q){ /* To integrate piece of exponentiated envelope to left of POINT q */ if(q->pl == NULL) // this is leftmost point in envelope Apop_stopif(1, return GSL_NAN,-5, "leftmost point in envelope"); if(q->pl->x == q->x) // interval is zero length return 0.; if (fabs(q->y - q->pl->y) < YEPS) // integrate straight line piece return 0.5*(q->ey + q->pl->ey)*(q->x - q->pl->x); // integrate exponential piece return ((q->ey - q->pl->ey)/(q->y - q->pl->y))*(q->x - q->pl->x); } double expshift(double y, double y0) { /* to exponentiate shifted y without underflow */ if (y - y0 > -2.0 * YCEIL) return exp(y - y0 + YCEIL); else return 0.0; } double logshift(double y, double y0){ /* inverse of function expshift */ return (log(y) + y0 - YCEIL); } double perfunc(apop_arms_settings *params, double x){ // to evaluate log density and increment count of evaluations Staticdef( apop_data *, d , apop_data_alloc(1,1)); d->matrix->data[0] = x; double y = apop_log_likelihood(d, params->model); Apop_assert(isfinite(y), "Evaluating the log likelihood at %g returned %g.", x, y); (params->neval)++; // increment count of function evaluations return y; } apophenia-1.0+ds/apop_asst.c000066400000000000000000000474021262736346100160760ustar00rootroot00000000000000 /** \file apop_asst.c The odds and ends bin. Copyright (c) 2005--2007, 2010 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include #include #include #ifdef _OPENMP #include #define omp_threadnum omp_get_thread_num() #else #define omp_threadnum 0 #endif extern char *apop_nul_string; //more efficient than xprintf, but a little less versatile. static void apop_tack_on(char **in, char *addme){ if (!addme) return; size_t inlen = *in? strlen(*in): 0; size_t total_len= inlen + strlen(addme); *in = realloc(*in, total_len+1); strcpy(*in+inlen, addme); } typedef int (*apop_fn_riip)(apop_data*, int, int, void*); /** Join together the \c text grid of an \ref apop_data set into a single string. For example, say that we have a data set with some text: row 0 has \c "a0", \c "b0", \c "c0"; row 2 has \c "a1", \c "b1", \c "c1"; and so on. We would like to produce \code insert into tab values ('a0', 'b0', 'c0'); insert into tab values ('a1', 'b1', 'c1'); ... \endcode This could be sent to an SQL engine to copy the data to a database (but this is just an example for demonstration---use \ref apop_data_print to write to a database table). To construct this single string from the text grid, we would need to add: \li before the text, Insert into tab values ('. \li between each element on a row: ', ' \li between rows: '); \\ninsert into tab values(' \li at the tail end: ');' Thus, do the conversion via: \code char *insert_string = apop_text_paste(indata, .before="Insert into tab values ('", .between="', '", .between_cols="'); \\ninsert into tab values(', .after="');'" ); \endcode \param strings An \ref apop_data set with a grid of text to be combined into a single string \param between The text to put in between the rows of the table, such as ", ". (Default is a single space: " ") \param before The text to put at the head of the string. For the query example, this would be .before="select ". (Default: NULL) \param after The text to put at the tail of the string. For the query example, .after=" from data_table". (Default: NULL) \param between_cols The text to insert between columns of text. See below for an example (Default is set to equal .between) \param prune If you don't want to use the entire text set, you can provide a function to indicate which elements should be pruned out. Some examples: \code //Just use column 3 int is_not_col_3(apop_data *indata, int row, int col, void *ignore){ return col!=3; } //Jump over blanks as if they don't exist. int is_blank(apop_data *indata, int row, int col, void *ignore){ return strlen(indata->text[row][col])==0; } \endcode \param prune_parameter A void pointer to pass to your \c prune function. \return A single string with the elements of the \c strings table joined as per your specification. Allocated by the function, to be freed by you if desired. \li If the table of strings is \c NULL or has no text, the output string will have only the .before and .after parts with nothing in between. \li if apop_opts.verbose >=3, then print the pasted text to stderr. \li It is sometimes useful to use \c Apop_r and \c Apop_rs to get a view of only one or a few rows in conjunction with this function. \li This function uses the \ref designated syntax for inputs. This sample snippet generates the SQL for a query using a list of column names (where the query begins with select , ends with from datatab, and has commas in between each element), re-processes the same list to produce the head of an HTML table, then produces the body of the table with the query result. \include sql_to_html.c */ #ifdef APOP_NO_VARIADIC char * apop_text_paste(apop_data const *strings, char *between, char *before, char *after, char *between_cols, apop_fn_riip prune, void *prune_parameter){ #else apop_varad_head(char *, apop_text_paste){ apop_data const *apop_varad_var(strings, NULL); char *apop_varad_var(between, " "); char *apop_varad_var(before, NULL); char *apop_varad_var(after, NULL); char *apop_varad_var(between_cols, between); apop_fn_riip apop_varad_var(prune, NULL); void *apop_varad_var(prune_parameter, NULL); return apop_text_paste_base(strings, between, before, after, between_cols, prune, prune_parameter); } char * apop_text_paste_base(apop_data const *strings, char *between, char *before, char *after, char *between_cols, apop_fn_riip prune, void *prune_parameter){ #endif char *prior_line=NULL, *oneline=NULL, *out = before ? strdup(before) : NULL; for (int i=0; i< ((!strings || !*strings->textsize)? 0 : *strings->textsize); i++){ free(oneline); oneline = NULL; for (int j=0; j< strings->textsize[1]; j++){ if (prune && !prune((apop_data*)strings, i, j, prune_parameter)) continue; apop_tack_on(&oneline, strings->text[i][j]); if (j textsize[1]-1) apop_tack_on(&oneline, between_cols); } apop_tack_on(&out, prior_line); if (prior_line && oneline) apop_tack_on(&out, between); free(prior_line); prior_line=oneline ? strdup(oneline): NULL; //if (i textsize[0]-1) apop_tack_on(&out, between); //if (oneline) apop_tack_on(&out, oneline); } apop_tack_on(&out, oneline); //the final one never got a chance to be prior_line apop_tack_on(&out, after); Apop_notify(3, "%s", out); return out; } /** Calculate \f$\sum_{n=1}^N {1\over n^s}\f$ \li There are no doubt efficient shortcuts do doing this, but I use brute force. [Though Knuth's Art of Programming v1 doesn't offer anything, which is strong indication of nonexistence.] To speed things along, I save the results so that they can just be looked up should you request the same calculation. \li If \c N is zero or negative, return NaN. Notify the user if apop_opts.verbosity >=0 For example: \include test_harmonic.c */ long double apop_generalized_harmonic(int N, double s){ /* Each row in the saved-results structure is an \f$s\f$, and each column is \f$1\dots n\f$, up to the largest \f$n\f$ calculated to date. When reading the code, remember that the zeroth element holds the value for N=1, and so on. */ Apop_stopif(N<=0, return GSL_NAN, 0, "N is %i, but must be greater than 0.", N); static double * eses = NULL; static int * lengths= NULL; static int count = 0; static long double ** precalced=NULL; int old_len, i; OMP_critical(generalized_harmonic) { //Due to memoization, this can't parallelize. for (i=0; i< count; i++) if (eses == NULL || eses[i] == s) break; if (i == count){ //you need to build the vector from scratch. count ++; i = count - 1; precalced = realloc(precalced, sizeof (long double*) * count); lengths = realloc(lengths, sizeof (int*) * count); eses = realloc(eses, sizeof (double) * count); precalced[i] = malloc(sizeof(long double) * N); lengths[i] = N; eses[i] = s; precalced[i][0] = 1; old_len = 1; } else { //then you found it. old_len = lengths[i]; } if (N-1 >= old_len){ //It's there, but you need to extend what you have. precalced[i] = realloc(precalced[i], sizeof(long double) * N); for (int j = old_len; jprintf-style arguments. E.g., \code char filenames[] = "apop_asst.c apop_asst.o" apop_system("ls -l %s", filenames); \endcode \return The return value of the \c system() call. */ int apop_system(const char *fmt, ...){ char *q; va_list argp; va_start(argp, fmt); Apop_stopif(vasprintf(&q, fmt, argp)==-1, return -1, 0, "Trouble writing to a string."); va_end(argp); int out = system(q); free(q); return out; } static int count_parens(const char *string){ int out = 0; int last_was_backslash = 0; for(const char *step =string; *step !='\0'; step++){ if (*step == '\\' && !last_was_backslash){ last_was_backslash = 1; continue; } if (*step == ')' && !last_was_backslash) out++; last_was_backslash = 0; } return out; } /** Extract subsets from a string via regular expressions. This function takes a regular expression and repeatedly applies it to an input string. It returns the count of matches, and optionally returns the matches themselves organized into the \c text grid of an \ref apop_data set. \li There are three common flavors of regular expression: Basic, Extended, and Perl-compatible (BRE, ERE, PCRE). I use EREs, as per the specs of your C library, which should match POSIX's ERE specification. For example, "p.val" will match "P value", "p.value", "p values" (and even "tempeval", so be careful). If you give a non-\c NULL address in which to place a table of paren-delimited substrings, I'll return them as a row in the text element of the returned \ref apop_data set. I'll return all the matches, filling the first row with substrings from the first application of your regex, then filling the next row with another set of matches (if any), and so on to the end of the string. Useful when parsing a list of items, for example. \param string The string to search (no default) \param regex The regular expression (no default) \param substrings Parens in the regex indicate that I should return matching substrings. Give me the _address_ of an \ref apop_data* set, and I will allocate and fill the text portion with matches. Default= \c NULL, meaning do not return substrings (even if parens exist in the regex). If no match, return an empty \ref apop_data set, so output->textsize[0]==0. \param use_case Should I be case sensitive, \c 'y' or \c 'n'? (default = \c 'n', which is not the POSIX default.) \return Count of matches found. 0 == no match. \c substrings may be allocated and filled if needed. \li If apop_opts.stop_on_warning='n' returns -1 on error (e.g., regex \c NULL or didn't compile). \li If strings==NULL, I return 0---no match---and if \c substrings is provided, set it to \c NULL. \li Here is the test function. Notice that the substring-pulling function call passes \c &subs, not plain \c subs. \include test_regex.c \li Each set of matches will be one row of the output data. E.g., given the regex ([A-Za-z])([0-9]), the column zero of outdata will hold letters, and column one will hold numbers. Use \ref apop_data_transpose to reverse this so that the letters are in outdata->text[0] and numbers in outdata->text[1]. */ #ifdef APOP_NO_VARIADIC int apop_regex(const char *string, const char* regex, apop_data **substrings, const char use_case){ #else apop_varad_head(int, apop_regex){ const char * apop_varad_var(string, NULL); apop_data **apop_varad_var(substrings, NULL); if (!string) { if (substrings) *substrings=NULL; return 0; } const char * apop_varad_var(regex, NULL); Apop_stopif(!regex, return -1, 0, "You gave me a NULL regex."); const char apop_varad_var(use_case, 'n'); return apop_regex_base(string, regex, substrings, use_case); } int apop_regex_base(const char *string, const char* regex, apop_data **substrings, const char use_case){ #endif regex_t re; int matchcount=count_parens(regex); int found, found_ct=0; regmatch_t result[matchcount+1]; int compiled_ok = !regcomp(&re, regex, REG_EXTENDED + (use_case=='y' ? 0 : REG_ICASE) + (substrings ? 0 : REG_NOSUB) ); Apop_stopif(!compiled_ok, return -1, 0, "This regular expression didn't compile: \"%s\"", regex); int matchrow = 0; if (substrings) *substrings = apop_data_alloc(); do { found_ct+= found = !regexec(&re, string, matchcount+1, result, matchrow ? REG_NOTBOL : 0); if (substrings && found){ *substrings = apop_text_alloc(*substrings, matchrow+1, matchcount); //match zero is the whole string; ignore. for (int i=0; i< matchcount; i++){ if (result[i+1].rm_eo > 0){//GNU peculiarity: match-to-empty marked with -1. int length_of_match = result[i+1].rm_eo - result[i+1].rm_so; if ((*substrings)->text[matchrow][i] != apop_nul_string) free((*substrings)->text[matchrow][i]); (*substrings)->text[matchrow][i] = malloc(strlen(string)+1); memcpy((*substrings)->text[matchrow][i], string + result[i+1].rm_so, length_of_match); (*substrings)->text[matchrow][i][length_of_match] = '\0'; } //else matches nothing; apop_text_alloc already made this cell this NULL. } string += result[0].rm_eo; //end of whole match; matchrow++; } } while (substrings && found && string[0]!='\0'); regfree(&re); return found_ct; } /** RNG from a Generalized Hypergeometric type B3. Devroye uses this as the base for many of his distribution-generators, including the Waring. \li If one of the inputs is <=0, error; return NaN and print a warning. */ //Header in stats.h double apop_rng_GHgB3(gsl_rng * r, double* a){ Apop_stopif(!((a[0]>0) && (a[1] > 0) && (a[2] > 0)), return NAN, 0, "all inputs must be positive."); double aa = gsl_ran_gamma(r, a[0], 1), b = gsl_ran_gamma(r, a[1], 1), c = gsl_ran_gamma(r, a[2], 1); int p = gsl_ran_poisson(r, aa*b/c); return p; } /** The Beta distribution is useful for modeling because it is bounded between zero and one, and can be either unimodal (if the variance is low) or bimodal (if the variance is high), and can have either a slant toward the bottom or top of the range (depending on the mean). The distribution has two parameters, typically named \f$\alpha\f$ and \f$\beta\f$, which can be difficult to interpret. However, there is a one-to-one mapping between (alpha, beta) pairs and (mean, variance) pairs. Since we have good intuition about the meaning of means and variances, this function takes in a mean and variance, calculates alpha and beta behind the scenes, and returns the appropriate Beta distribution. \param m The mean the Beta distribution should have. Notice that m is in [0,1]. \param v The variance which the Beta distribution should have. It is in (0, 1/12), where (1/12) is the variance of a Uniform(0,1) distribution. Funny things happen with variance near 1/12 and mean far from 1/2. \return Returns an \ref apop_model produced by copying the \c apop_beta model and setting its parameters appropriately. \exception out->error=='r' Range error: mean is not within [0, 1]. */ apop_model *apop_beta_from_mean_var(double m, double v){ Apop_stopif(m>=1|| m<=0, apop_model *out = apop_model_copy(apop_beta); out->error='r'; return out, 0, "You asked for a beta distribution " "with mean %g, but the mean of the beta will always " "be strictly between zero and one.", m); double k = (m * (1- m)/ v) -1; double alpha = m*k; double beta = k * (1-m); return apop_model_set_parameters(apop_beta, alpha, beta); } /** \def apop_rng_get_thread The \c gsl_rng is not itself thread-safe, in the sense that it can not be used simultaneously by multiple threads. However, if each thread has its own \c gsl_rng, then each will safely operate independently. Thus, Apophenia keeps an internal store of RNGs for use by threaded functions. If the input to this function, \c thread, is greater than any previous input, then the array of gsl_rngs is extended to length \c thread, and each element extended using ++apop_opts.rng_seed (i.e., the seed is incremented before use). This function can be used anywhere a \c gsl_rng would be used. \param thread_in The number of the RNG to retrieve, starting at zero (which is how OpenMP numbers its threads). If -1, I'll look up the current thread (via \c omp_get_thread_num) for you. See \ref threading for additional notes. In most cases, you want to use apop_rng_get_thread(-1). \return The appropriate RNG, initialized if necessary. \hideinitializer */ gsl_rng *apop_rng_get_thread_base(int thread){ static gsl_rng **rngs; static int rng_ct = -1; if (thread==-1){ #ifdef OpenMP thread = omp_get_thread_num(); #else thread = 0; #endif } OMP_critical(rng_get_thread) if (thread > rng_ct) { rngs = realloc(rngs, sizeof(gsl_rng*)*(thread+1)); for (int i=rng_ct+1; i<= thread; i++) rngs[i] = apop_rng_alloc(++apop_opts.rng_seed); rng_ct = thread; } return rngs[thread]; } /** Make a set of random draws from a model and write them to an \ref apop_data set. \param model The model from which draws will be made. Must already be prepared and/or estimated. \param count The number of draws to make. If \c draw_matrix is not \c NULL, then this is ignored and count=draw_matrix->matrix->size1. default=1000. \param draws If not \c NULL, a pre-allocated data set whose \c matrix element will be filled with draws. \return An \ref apop_data set with the matrix filled with \c size draws. If draw_matrix!=NULL, then return a pointer to it. \exception out->error=='m' Input model isn't good for making draws: it is \c NULL, or m->dsize=0. \exception out->error=='s' You gave me a \c draws matrix, but its size is less than the size of a single draw from the data, model->dsize. \exception out->error=='d' Trouble drawing from the distribution for at least one row. That row is set to all \c NAN. \li Prints a warning if you send in a non-NULL apop_data set, but its \c matrix element is \c NULL, when apop_opts.verbose>=1. \li See also \ref apop_draw, which makes a single draw. \li Random numbers are generated using RNGs from \ref apop_rng_get_thread, qv. Here is a two-line program to draw a different set of ten Standard Normals on every run (provided runs are more than a second apart): \include draw_some_normals.c \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_model_draws(apop_model *model, int count, apop_data *draws){ #else apop_varad_head(apop_data *, apop_model_draws){ apop_model * apop_varad_var(model, NULL); Apop_stopif(!model, apop_return_data_error(n), 0, "Input model is NULL."); Apop_stopif(!model->dsize, apop_return_data_error(n), 0, "Input model has dsize==0."); apop_data * apop_varad_var(draws, NULL); int apop_varad_var(count, 1000); if (draws) { Apop_stopif(!draws->matrix, draws->error='m'; return draws, 1, "Input data set's matrix is NULL."); Apop_stopif((int)draws->matrix->size2 < model->dsize, draws->error='s'; draws->error='m'; return draws, 1, "Input data set's matrix column count is less than model->dsize."); count = draws->matrix->size1; } else Apop_stopif(model->dsize<=0, apop_return_data_error(n), 0, "model->dsize<=0, so I don't know the size of matrix to allocate."); return apop_model_draws_base(model, count, draws); } apop_data * apop_model_draws_base(apop_model *model, int count, apop_data *draws){ #endif apop_data *out = draws ? draws : apop_data_alloc(count, model->dsize); OMP_for (int i=0; i< count; i++){ apop_data *onerow = Apop_r(out, i); Apop_stopif(apop_draw(onerow->matrix->data, apop_rng_get_thread(omp_threadnum), model), gsl_matrix_set_all(onerow->matrix, GSL_NAN); out->error='d', 0, "Trouble drawing for row %i. " "I set it to all NANs and set out->error='d'.", i); } return out; } apophenia-1.0+ds/apop_bootstrap.c000066400000000000000000000230551262736346100171370ustar00rootroot00000000000000 /** \file apop_bootstrap.c Copyright (c) 2006--2007 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" /** Initialize a \c gsl_rng. Uses the Tausworth routine. \param seed The seed. No need to get funny with it: 0, 1, and 2 will produce wholly different streams. \return The RNG ready for your use. \li If you are confident that your code is debugged and would like a new stream of values every time your program runs (provided your runs are more than a second apart), seed with the time: \include draw_some_normals.c */ gsl_rng *apop_rng_alloc(int seed){ static int first_use = 1; if (first_use){ first_use = 0; OMP_critical(rng_env_setup) //GSL makes vague promises about thread-safety gsl_rng_env_setup(); } gsl_rng *setme = gsl_rng_alloc(gsl_rng_taus2); gsl_rng_set(setme, seed); return setme; } /** Give me a data set and a model, and I'll give you the jackknifed covariance matrix of the model parameters. The basic algorithm for the jackknife (glossing over the details): create a sequence of data sets, each with exactly one observation removed, and then produce a new set of parameter estimates using that slightly shortened data set. Then, find the covariance matrix of the derived parameters. \li Jackknife or bootstrap? As a broad rule of thumb, the jackknife works best on models that are closer to linear. The worse a linear approximation does (at the given data), the worse the jackknife approximates the variance. \param in The data set. An \ref apop_data set where each row is a single data point. \param model An \ref apop_model, that will be used internally by \ref apop_estimate. \exception out->error=='n' \c NULL input data. \return An \c apop_data set whose matrix element is the estimated covariance matrix of the parameters. \see apop_bootstrap_cov For example: \include jack.c */ apop_data * apop_jackknife_cov(apop_data *in, apop_model *model){ Apop_stopif(!in, apop_return_data_error(n), 0, "The data input can't be NULL."); Get_vmsizes(in); //msize1, msize2, vsize apop_model *e = apop_model_copy(model); int i, n = GSL_MAX(msize1, GSL_MAX(vsize, in->textsize[0])); apop_model *overall_est = e->parameters ? e : apop_estimate(in, e);//if not estimated, do so gsl_vector *overall_params = apop_data_pack(overall_est->parameters); gsl_vector_scale(overall_params, n); //do it just once. gsl_vector *pseudoval = gsl_vector_alloc(overall_params->size); //Copy the original, minus the first row. apop_data *subset = apop_data_copy(Apop_rs(in, 1, n-1)); apop_name *tmpnames = in->names; in->names = NULL; //save on some copying below. apop_data *array_of_boots = apop_data_alloc(n, overall_params->size); for(i = -1; i< n-1; i++){ //Get a view of row i, and copy it to position i-1 in the short matrix. if (i >= 0) apop_data_memcpy(Apop_r(subset, i), Apop_r(in, i)); apop_model *est = apop_estimate(subset, e); gsl_vector *estp = apop_data_pack(est->parameters); gsl_vector_memcpy(pseudoval, overall_params);// *n above. gsl_vector_scale(estp, n-1); gsl_vector_sub(pseudoval, estp); gsl_matrix_set_row(array_of_boots->matrix, i+1, pseudoval); apop_model_free(est); gsl_vector_free(estp); } in->names = tmpnames; apop_data *out = apop_data_covariance(array_of_boots); gsl_matrix_scale(out->matrix, 1./(n-1.)); apop_data_free(subset); gsl_vector_free(pseudoval); apop_data_free(array_of_boots); if (e!=overall_est) apop_model_free(overall_est); apop_model_free(e); gsl_vector_free(overall_params); return out; } /** Give me a data set and a model, and I'll give you the bootstrapped covariance matrix of the parameter estimates. \param data The data set. An \c apop_data set where each row is a single data point. (No default) \param model An \ref apop_model, whose \c estimate method will be used here. (No default) \param iterations How many bootstrap draws should I make? (default: 1,000) \param rng An RNG that you have initialized, probably with \c apop_rng_alloc. (Default: an RNG from \ref apop_rng_get_thread) \param keep_boots Deprecated; use \c boot_store. \param boot_store If not \c NULL, put the list of drawn parameter values here, with one parameter set per row. Sample use: apop_data *boots; apop_bootstrap_cov(data, model, .boot_store=&boots); apop_data_print(boots); They are packed via \ref apop_data_pack, so use \ref apop_data_unpack if needed. (Default: 'n') \code apop_data *boot_output = apop_bootstrap_cov(your_data, your_model, .keep_boots='y'); apop_data *boot_stats = apop_data_get_page(boot_output, ""); printf("The statistics calculated on the 28th iteration:\n"); gsl_vector *row_27 = Apop_rv(boot_stats, 27); apop_data_print(apop_data_unpack(row_27)); \endcode \param ignore_nans If \c 'y' and any of the elements in the estimation return \c NaN, then I will throw out that draw and try again. If \c 'n', then I will write that set of statistics to the list, \c NaN and all. I keep count of throw-aways; if there are more than \c iterations elements thrown out, then I throw an error and return with estimates using data I have so far. That is, I assume that \c NaNs are rare edge cases; if they are as common as good data, you might want to rethink how you are using the bootstrap mechanism. (Default: 'n') \return An \c apop_data set whose matrix element is the estimated covariance matrix of the parameters. \exception out->error=='n' \c NULL input data. \exception out->error=='N' \c too many NaNs. \li This function uses the \ref designated syntax for inputs. This example is a sort of demonstration of the Central Limit Theorem. The model is a simulation, where each call to the estimation routine produces the mean/std dev of a set of draws from a Uniform Distribution. Because the simulation takes no inputs, \ref apop_bootstrap_cov simply re-runs the simulation and calculates a sequence of mean/std dev pairs, and reports the covariance of that generated data set. \include boot_clt.c \see apop_jackknife_cov */ #ifdef APOP_NO_VARIADIC apop_data * apop_bootstrap_cov(apop_data * data, apop_model *model, gsl_rng *rng, int iterations, char keep_boots, char ignore_nans, apop_data **boot_store){ #else apop_varad_head(apop_data *, apop_bootstrap_cov) { apop_data * apop_varad_var(data, NULL); apop_model *model = varad_in.model; int apop_varad_var(iterations, 1000); gsl_rng * apop_varad_var(rng, apop_rng_get_thread()); char apop_varad_var(keep_boots, 'n'); apop_data** apop_varad_var(boot_store, NULL); char apop_varad_var(ignore_nans, 'n'); return apop_bootstrap_cov_base(data, model, rng, iterations, keep_boots, ignore_nans, boot_store); } apop_data * apop_bootstrap_cov_base(apop_data * data, apop_model *model, gsl_rng *rng, int iterations, char keep_boots, char ignore_nans, apop_data **boot_store){ #endif Get_vmsizes(data); //vsize, msize1, msize2 apop_model *e = apop_model_copy(model); apop_data *subset = apop_data_copy(data); apop_data *array_of_boots = NULL, *summary; //prevent and infinite regression of covariance calculation. Apop_model_add_group(e, apop_parts_wanted); //default wants for nothing. size_t i, nan_draws=0; apop_name *tmpnames = (data && data->names) ? data->names : NULL; //save on some copying below. if (data && data->names) data->names = NULL; int height = GSL_MAX(msize1, GSL_MAX(vsize, (data?(*data->textsize):0))); for (i=0; iparameters); if (!gsl_isnan(apop_sum(estp))){ if (i==0){ array_of_boots = apop_data_alloc(iterations, estp->size); apop_name_stack(array_of_boots->names, est->parameters->names, 'c', 'v'); apop_name_stack(array_of_boots->names, est->parameters->names, 'c', 'c'); apop_name_stack(array_of_boots->names, est->parameters->names, 'c', 'r'); } gsl_matrix_set_row(array_of_boots->matrix, i, estp); } else if (ignore_nans=='y'){ i--; nan_draws++; } apop_model_free(est); gsl_vector_free(estp); } if(data) data->names = tmpnames; apop_data_free(subset); apop_model_free(e); int set_error=0; Apop_stopif(i == 0 && nan_draws == iterations, apop_return_data_error(N), 1, "I ran into %i NaNs and no not-NaN estimations, and so stopped. " , iterations); Apop_stopif(nan_draws == iterations, set_error++; apop_matrix_realloc(array_of_boots->matrix, i, array_of_boots->matrix->size2), 1, "I ran into %i NaNs, and so stopped. Returning results based " "on %zu bootstrap iterations.", iterations, i); summary = apop_data_covariance(array_of_boots); if (!boot_store && (keep_boots == 'n' || keep_boots == 'N')) apop_data_free(array_of_boots); if (keep_boots != 'n' && keep_boots != 'N') //deprecated version apop_data_add_page(summary, array_of_boots, ""); if (boot_store) *boot_store = array_of_boots; if (set_error) summary->error = 'N'; return summary; } apophenia-1.0+ds/apop_conversions.c000066400000000000000000001626661262736346100175060ustar00rootroot00000000000000 /** \file apop_conversions.c The various functions to convert from one format to another. */ /* Copyright (c) 2006--2010, 2012 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include //GSL_NAN #include #include /*extend a string. this prevents a minor leak you'd get if you did asprintf(&q, "%s is a teapot.", q); q may be NULL, which prints the string "null", so use the little XN macro below when using this function. This is internal to apop. right now. */ void xprintf(char **q, char *format, ...){ va_list ap; char *r = *q; va_start(ap, format); Apop_stopif(vasprintf(q, format, ap)==-1, , 0, "Trouble writing to a string."); va_end(ap); free(r); } /** Copies a one-dimensional array to a gsl_vector. The input array is undisturbed. \param in An array of doubles. (No default. Must not be \c NULL); \param size How long \c line is. If this is zero or omitted, I'll guess using the sizeof(line)/sizeof(line[0]) trick, which will work for most arrays allocated using double [] and won't work for those allocated using double *. (default = auto-guess) \return A gsl_vector, allocated and filled with a copy of (not a pointer to) the input data. \li If you send in a \c NULL vector, you get a \c NULL pointer in return. I warn you of this if apop_opts.verbosity >=1 . \li This function uses the \ref designated syntax for inputs. \see \ref apop_data_falloc */ #ifdef APOP_NO_VARIADIC gsl_vector * apop_array_to_vector(double *in, int size){ #else apop_varad_head(gsl_vector *, apop_array_to_vector){ double * apop_varad_var(in, NULL); Apop_assert_c(in, NULL, 1, "You sent me NULL data; returning NULL."); int apop_varad_var(size, sizeof(in)/sizeof(in[0])); return apop_array_to_vector_base(in, size); } gsl_vector * apop_array_to_vector_base(double *in, int size){ #endif gsl_vector *out = gsl_vector_alloc(size); gsl_vector_view v = gsl_vector_view_array((double*)in, size); gsl_vector_memcpy(out,&(v.vector)); return out; } /** This function copies the data in a vector to a new one-column (or one-row) matrix and returns the newly-allocated and filled matrix. For the reverse, try \ref apop_data_pack. \param in a \c gsl_vector (No default. If \c NULL, I return \c NULL, with a warning if apop_opts.verbose >=1 ) \param row_col If \c 'r', then this will be a row (1 x N) instead of the default, a column (N x 1). (default: \c 'c') \return a newly-allocated gsl_matrix with one column (or row). \li If you send in a \c NULL vector, you get a \c NULL pointer in return. I warn you of this if apop_opts.verbosity >=2 . \li If \c gsl_matrix_alloc fails you get a \c NULL pointer in return. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC gsl_matrix * apop_vector_to_matrix(const gsl_vector *in, char row_col){ #else apop_varad_head(gsl_matrix *, apop_vector_to_matrix){ const gsl_vector * apop_varad_var(in, NULL); Apop_assert_c(in, NULL, 2, "Converting NULL vector to NULL matrix."); char apop_varad_var(row_col, 'c'); return apop_vector_to_matrix_base(in, row_col); } gsl_matrix * apop_vector_to_matrix_base(const gsl_vector *in, char row_col){ #endif bool isrow = (row_col == 'r' || row_col == 'R'); gsl_matrix *out = isrow ? gsl_matrix_alloc(1, in->size) : gsl_matrix_alloc(in->size, 1); Apop_assert(out, "gsl_matrix_alloc failed; probably out of memory."); (isrow ? gsl_matrix_set_row : gsl_matrix_set_col)(out, 0, in); return out; } static int find_cat_index(char **d, char * r, int start_from, int size){ //used for apop_db_to_crosstab. int i = start_from % size; //i is probably the same or i+1. do { if(!strcmp(d[i], r)) return i; i++; i %= size; //loop around as necessary. } while (i!=start_from); Apop_assert_c(0, -2, 0, "Something went wrong in the crosstabbing; couldn't find %s.", r); } /**Give the name of a table in the database, and optional names of three of its columns: the x-dimension, the y-dimension, and the data. The output is a 2D matrix with rows indexed by 'row' and cols by 'col' and the cells filled with the entry in the 'data' column. \param tabname The database table I'm querying. Anything that will work inside a \c from clause is OK, such as a subquery in parens. (no default; must not be \c NULL) \param row The column of the data set that will indicate the rows of the output crosstab (no default; must not be \c NULL) \param col The column of the data set that will indicate the columns of the output crosstab (no default; must not be \c NULL) \param data The column of the data set holding the data for the cells of the crosstab (default: count(*)) \param is_aggregate Set to \c 'y' if the \c data is a function like count(*) or sum(col). That is, set to \c 'y' if querying this would require a group by clause. (default: if I find an end-paren in \c datacol, \c 'y'; else \c 'n'.) \li If the query to get data to fill the table (select row, col, data from tabname) returns an empty data set, then I will return a \c NULL data set and if apop_opts.verbosity >= 1 print a warning. \exception out->error='n' Name not found error. \exception out->error='q' Query returned an empty table (which might mean that it just failed). \li The simplest use is to get a tally of how often (r1, r2) appears in the data via apop_db_to_crosstab("datatab", "r1", "r2"). \li If you want a 1-D crosstab, omit the other dimension. Or omit both to get a grand tally of your statistic for the entire table. \li There is a commnad-line tool, apop_db_to_crosstab that calls this function. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_db_to_crosstab(char const*tabname, char const*row, char const* col, char const*data, char is_aggregate){ #else apop_varad_head(apop_data *, apop_db_to_crosstab){ char const* apop_varad_var(tabname, NULL); Apop_stopif(!tabname, return NULL, 1, "Missing tabname. Returning NULL."); char const* apop_varad_var(row, "1"); char const* apop_varad_var(col, "1"); char const* apop_varad_var(data, "count(*)"); //This '(' balances the end-paren below, keeping m4 from losing the thread. //Note the transitional check for "group by", which we should one day remove. char apop_varad_var(is_aggregate, (strchr(data, ')') && !strstr(data, "group by"))?'y':'n'); return apop_db_to_crosstab_base(tabname, row, col, data, is_aggregate); } apop_data * apop_db_to_crosstab_base(char const*tabname, char const*row, char const* col, char const*data, char is_aggregate){ #endif gsl_matrix *out=NULL; int i, j=0; apop_data *pre_d1=NULL, *pre_d2=NULL, *datachars=NULL; apop_data *outdata = apop_data_alloc(); char* p = apop_opts.db_name_column; apop_opts.db_name_column = NULL;//we put this back at the end. char *Q; Asprintf(&Q, "select %s, %s, %s from %s %s %s %s %s", row, col, data, tabname, is_aggregate!='n' ? "group by" : "", is_aggregate!='n' ? row : "", is_aggregate!='n' ? "," : "", is_aggregate!='n' ? col : ""); datachars = apop_query_to_text("%s", Q); Apop_stopif(!datachars, free(Q); return NULL, 2, "[%s] returned an empty table.", Q); Apop_stopif(datachars->error, free(Q); goto bailout, 0, "error from [%s].", Q); //A bit inefficient, but well-encapsulated. //Pull the distinct (sorted) list of headers, copy into outdata->names. pre_d1 = apop_query_to_text("select distinct %s, 1 from %s order by %s", row, tabname, row); Apop_stopif(!pre_d1||pre_d1->error, outdata->error='q'; goto bailout, 0, "Error querying %s from %s.", row, tabname); for (i=0; i < pre_d1->textsize[0]; i++) apop_name_add(outdata->names, pre_d1->text[i][0], 'r'); pre_d2 = apop_query_to_text("select distinct %s from %s order by %s", col, tabname, col); Apop_stopif(!pre_d2||pre_d2->error, outdata->error='q'; goto bailout, 0, "Error querying %s from %s.", row, tabname); for (i=0; i < pre_d2->textsize[0]; i++) apop_name_add(outdata->names, pre_d2->text[i][0], 'c'); out = gsl_matrix_calloc(pre_d1->textsize[0], pre_d2->textsize[0]); for (size_t k =0; k< datachars->textsize[0]; k++){ i = find_cat_index(outdata->names->row, datachars->text[k][0], i, pre_d1->textsize[0]); j = find_cat_index(outdata->names->col, datachars->text[k][1], j, pre_d2->textsize[0]); Apop_stopif(i==-2 || j == -2, outdata->error='n'; goto bailout, 0, "Something went wrong in the crosstabbing; " "couldn't find %s or %s.", datachars->text[k][0], datachars->text[k][1]); gsl_matrix_set(out, i, j, atof(datachars->text[k][2])); } bailout: apop_data_free(pre_d1); apop_data_free(pre_d2); apop_data_free(datachars); outdata->matrix = out; apop_opts.db_name_column = p; return outdata; } /** See \ref apop_db_to_crosstab for the storyline; this is the complement, which takes a crosstab and writes its values to the database. For example, I would take
c0c1
r023
r104
and do the following writes to the database: \code insert into your_table values ('r0', 'c0', 2); insert into your_table values ('r0', 'c1', 3); insert into your_table values ('r1', 'c0', 3); insert into your_table values ('r1', 'c1', 4); \endcode \li If your data set does not have names (or not enough names), I will use the scheme above, filling in names of the form r0, r1, ... c0, c1, .... Text columns get their own names, t0, t1. \li This function handles only the matrix and text. */ void apop_crosstab_to_db(apop_data *in, char *tabname, char *row_col_name, char *col_col_name, char *data_col_name){ apop_name *n = in->names; char *colname, *rowname; Get_vmsizes(in); //msize1, msize2 int maxcol= GSL_MAX(msize2, in->textsize[1]); char sparerow[msize1 > 0 ? (int)log10(msize1)+1 : 0]; char sparecol[maxcol > 0 ? (int)log10(maxcol)+1 : 0]; #define DbType apop_opts.db_engine=='m' ? "text" : "character" #define DbType2 apop_opts.db_engine=='m' ? "double" : "numeric" apop_query("CREATE TABLE %s (%s %s, %s %s, %s %s)", tabname, row_col_name, DbType, col_col_name, DbType, data_col_name, DbType2); apop_query("begin"); for (int i=0; i< msize1; i++){ rowname = (n->rowct > i) ? n->row[i] : (sprintf(sparerow, "r%i", i), sparerow); for (int j=0; j< msize2; j++){ colname = (n->colct > j) ? n->col[j] : (sprintf(sparecol, "c%i", j), sparecol); double x = gsl_matrix_get(in->matrix, i, j); if (!isnan(x)) apop_query("INSERT INTO %s VALUES ('%s', '%s', %g)", tabname, rowname, colname, x); else apop_query("INSERT INTO %s VALUES ('%s', '%s', 0/0)", tabname, rowname, colname); } } for (int i=0; i< in->textsize[0]; i++){ rowname = (n->rowct > i) ? n->row[i] : (sprintf(sparerow, "r%i", i), sparerow); for (int j=0; j< in->textsize[1]; j++){ colname = (n->textct > j) ? n->text[j] : (sprintf(sparecol, "t%i", j), sparecol); apop_query("INSERT INTO %s VALUES ('%s', '%s', '%s')", tabname, rowname, colname, in->text[i][j]); } } apop_query("commit"); } /** One often finds data where the column indicates the value of the data point. There may be two columns, and a mark in the first indicates a miss while a mark in the second is a hit. Or say that we have the following list of observations: \code 2 3 3 2 1 1 2 1 1 2 1 1 \endcode Then we could write this as: \code 0 1 2 3 ---------- 0 6 4 2 \endcode because there are six 1s observed, four 2s observed, and two 3s observed. We call this rank format, because 1 (or zero) is typically the most common, 2 is second most common, et cetera. This function takes in a list of observations, and aggregates them into a single row in rank format. \li For the complement, see \ref apop_data_rank_expand. \li See also \ref apop_data_to_factors to convert real numbers or text into a matrix of categories. \param in The input \ref apop_data set. If \c NULL, return \c NULL. \param min_bins If this is omitted, the number of bins is simply the largest number found. So if there are bins {0, 1, 2} and your data set happens to consist of 0 0 1 1 0, then I won't know to generate results with three bins where the last bin has a count of zero. Set .min_bins=2 to ensure that bin is included. \include test_ranks.c \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_rank_compress(apop_data *in, int min_bins){ #else apop_varad_head(apop_data *, apop_data_rank_compress){ apop_data * apop_varad_var(in, NULL); if (!in) return NULL; int apop_varad_var(min_bins, 0); return apop_data_rank_compress_base(in, min_bins); } apop_data * apop_data_rank_compress_base(apop_data *in, int min_bins){ #endif Get_vmsizes(in); int upper_bound = GSL_MAX(in->matrix ? gsl_matrix_max(in->matrix) : min_bins, in->vector ? gsl_vector_max(in->vector) : min_bins); apop_data *out = apop_data_calloc(1, upper_bound+1); for (int i=0; i< msize1; i++) for (int j=0; j< msize2; j++) (*gsl_matrix_ptr(out->matrix, 0, apop_data_get(in, i, j)))++; for (int i=0; i< vsize; i++) (*gsl_matrix_ptr(out->matrix, 0, apop_data_get(in, i, -1)))++; return out; } /** The complement to this is \ref apop_data_rank_compress; see that function's documentation for the story and an example. This function takes in a data set where the zeroth column includes the count(s) of times that zero was observed, the first gives the count(s) of times that one was observed, et cetera. It outputs a data set whose vector element includes a list that has exactly the given frequency of zeros, ones, et cetera. */ apop_data *apop_data_rank_expand (apop_data *in){ int total_ct = (in->matrix ? apop_matrix_sum(in->matrix) : 0) + (in->vector ? apop_vector_sum(in->vector) : 0); if (total_ct == 0) return NULL; apop_data *out = apop_data_alloc(total_ct); int posn = 0; for (int i=0; i< in->matrix->size1; i++) for (int k=0; k< in->matrix->size2; k++) for (int j=0; j< gsl_matrix_get(in->matrix, i, k); j++) gsl_vector_set(out->vector, posn++, k); return out; } /** Copy one gsl_vector to another. That is, all data is duplicated. Unlike gsl_vector_memcpy, this function allocates and returns the destination, so you can use it like this: \code gsl_vector *a_copy = apop_vector_copy(original); \endcode \param in The input vector \return A structure that this function will allocate and fill. If \c gsl_vector_alloc fails, returns \c NULL and print a warning. */ gsl_vector *apop_vector_copy(const gsl_vector *in){ if (!in) return NULL; gsl_vector *out = gsl_vector_alloc(in->size); Apop_stopif(!out, return NULL, 0, "failed to allocate a gsl_vector of size %zu. Out of memory?", in->size); gsl_vector_memcpy(out, in); return out; } /** Copy one gsl_matrix to another. That is, all data are duplicated. Unlike gsl_matrix_memcpy, this function allocates and returns the destination, so you can use it like this: \code gsl_matrix *a_copy = apop_matrix_copy(original); \endcode \param in the input data \return A structure that this function will allocate and fill. If \c gsl_matrix_alloc fails, returns \c NULL. */ gsl_matrix *apop_matrix_copy(const gsl_matrix *in){ if (!in) return NULL; gsl_matrix *out = gsl_matrix_alloc(in->size1, in->size2); Apop_stopif(!out, return NULL, 0, "failed to allocate a gsl_matrix of size %zu x %zu. Out of memory?", in->size1, in->size2); gsl_matrix_memcpy(out, in); return out; } ///////////////The text processing section /** \page text_format Input text file formatting This reference section describes the assumptions made by \ref apop_text_to_db and \ref apop_text_to_data. Each row of the file will be converted to one record in the database or one row in the matrix. Values on one row are separated by delimiters. Fixed-width input is also OK; see below. By default, the delimiters are set to "|,\t", meaning that a pipe, comma, or tab will delimit separate entries. To change the default, use an argument to \ref apop_text_to_db or \ref apop_text_to_data like .delimiters=" \t" or .delimiters="|". The input text file must be UTF-8 or traditional ASCII encoding. Delimiters must be ASCII characters. If your data is in another encoding, try the POSIX-standard \c iconv program to filter the data to UTF-8. \li The character after a backslash is read as a normal character, even if it is a delimiter, \c #, or \c ". \li If a field contains several such special characters, surround it by \c "s. The surrounding marks are stripped and the text read verbatim. \li Text does not need to be delimited by quotes (unless there are special characters). If a text field is quote-delimited, I'll strip them. E.g., "Males, 30-40", is an OK column name, as is "Males named \"Joe\\"". \li Everything after an unprotected \c # is taken to be comments and ignored. \li Blank lines (empty or consisting only of white space) are also ignored. \li If you are reading into the gsl_matrix element of an \ref apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set apop_opts.verbose==0 beforehand. For mixed text/numeric data, try using \ref apop_text_to_db and then \ref apop_query_to_mixed_data. \li There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert a NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, I check whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then a NaN is inserted. If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try: \code perl -pi.bak -e 's/,,/,NaN,/g' data_file \endcode If you have missing data delimiters, you will need to set \ref apop_opts_type "apop_opts.nan_string" to text that matches the given format. E.g., \code //Apophenia's default NaN string, matching NaN, nan, or NAN, but not Nancy: apop_opts.nan_string = "NaN"; //Popular alternatives: apop_opts.nan_string = "Missing"; apop_opts.nan_string = "."; //Or, turn off nan-string checking entirely with: apop_opts.nan_string = NULL; \endcode SQLite stores these NaN-type values internally as \c NULL; that means that functions like \ref apop_query_to_data will convert both your \c nan_string string and \c NULL to \c NaN. \li The system uses the standards for C's \c atof() function for floating-point numbers: INFINITY, -INFINITY, and NaN work as expected. \li If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the sequence of column names like row names. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points). \li White space before or after a field is ignored. So 1, 2,3, 4 , 5, " six ",7 is eqivalent to 1,2,3,4,5," six ",7. \li NUL characters ('\0') are treated as white space, so if your fields have NULs as padding, you should have no problem. NULs inside of a string terminates the string as it always does in C. \li Fixed-width formats are supported (for plain ASCII encoding only), but you have to provide a list of field ending positions. For example, given \code NUMLEOL 123AABB 456CCDD \endcode and .field_ends=(int[]){3, 5, 7}, we have three columns, named NUM, LE, and OL. The names can be read from the first row by setting .has_row_names='y'. */ static int prep_text_reading(char const *text_file, FILE **infile){ *infile = !strcmp(text_file, "-") ? stdin : fopen(text_file, "r"); Apop_assert_c(*infile, 1, 0, "Trouble opening %s. Returning NULL.", text_file); return 0; } /////New text file reading /** \cond doxy_ignore */ extern char *apop_nul_string; #define Textrealloc(str, len) (str) = \ (str) != apop_nul_string \ ? realloc((str), (len)) \ : (((len) > 0) ? malloc(len) : apop_nul_string); typedef struct {int ct; int eof;} line_parse_t; /** \endcond */ static line_parse_t parse_a_fixed_line(FILE *infile, apop_data *fn, int const *field_ends){ int c = fgetc(infile); int ct = 0, posn=0, thisflen=0, needfield=1; while(c!='\n' && c !=EOF){ posn++; if (needfield){//start a new field if (++ct > fn->textsize[0]) apop_text_alloc(fn, ct, 1);//realloc text portion. thisflen = needfield = 0; } //extend field: thisflen++; Textrealloc(*fn->text[ct-1], thisflen); fn->text[ct-1][0][thisflen-1] = c; if (posn==*field_ends){ //close off this field. Textrealloc(*fn->text[ct-1], thisflen+1); fn->text[ct-1][0][thisflen] = '\0'; thisflen = 0; field_ends++; needfield=1; } c = fgetc(infile); } if (needfield==0){//user didn't give last field end. Textrealloc(*fn->text[ct-1], thisflen+1); fn->text[ct-1][0][thisflen] = '\0'; } return (line_parse_t) {.ct=ct, .eof= (c == EOF)}; } /** \cond doxy_ignore */ typedef struct{ char c, type; } apop_char_info; /** \endcond */ static const size_t bs=1e5; static int get_next(char *buffer, size_t *ptr, FILE *infile){ int r; if (*ptr>=bs){ size_t len=fread(buffer, 1, bs, infile); if (len < bs) buffer[len]=(char)-1; *ptr=0; } r = buffer[(*ptr)++]; return r == (char)-1 ? EOF : r; } static apop_char_info parse_next_char(char *buffer, size_t *ptr, FILE *f, char const *delimiters){ int c = get_next(buffer, ptr, f); int is_delimiter = !!strchr(delimiters, c); return (apop_char_info){.c=c, .type = (c==' '||c=='\r' ||c=='\t' || c==0)? (is_delimiter ? 'W' : 'w') :is_delimiter ? 'd' :(c == '\n') ? 'n' :(c == '"') ? '"' :(c == '\\') ? '\\' :(c == EOF) ? 'E' :(c == '#') ? '#' : 'r' }; } //fills fn with a list of strings. //returns the count of elements. Negate the count if we're at EOF. //fn must already be allocated via apop_data_alloc() [no args]. static line_parse_t parse_a_line(FILE *infile, char *buffer, size_t *ptr, apop_data *fn, int const *field_ends, char const *delimiters){ int ct=0, thisflen=0, inqq=0, infield=0, mlen=5, lastwhite=0, lastnonwhite=0; if (field_ends) return parse_a_fixed_line(infile, fn, field_ends); apop_char_info ci; do { ci = parse_next_char(buffer, ptr, infile, delimiters); //comments are to end of line, so they're basically a newline. if (ci.type=='#' && !inqq){ for(int c='x'; (c!='\n' && c!=EOF); ) c = get_next(buffer, ptr, infile); ci.type='n'; } //The escape-type cases: \\ and "". //If one applies, set the type to regular if (ci.type=='\\'){ ci=parse_next_char(buffer, ptr, infile, delimiters); if (ci.type!='E') ci.type='r'; } if ((inqq && ci.type !='"') && ci.type !='E') ci.type='r'; else if (ci.type=='"') inqq = !inqq; if (ci.type=='W' && lastwhite==1) continue; //compress these. lastwhite=(ci.type=='W'); if (!infield){ if (ci.type=='w') continue; //eat leading spaces. if (ci.type=='r' || ci.type=='d' //new field; if 'dnE', blank field. || (strchr("nE", ci.type) && ct>0)){ //Blank fields only at end of lines that already have data; else all-blank line to ignore. if (++ct > fn->textsize[0]) apop_text_alloc(fn, ct, 1);//realloc text portion. Textrealloc(*fn->text[ct-1], 5); thisflen = 0; mlen=5; infield=1; } } if (infield){ if (ci.type=='d'||ci.type=='n' || ci.type=='E' || ci.type=='W'){ //delimiter; close off this field. fn->text[ct-1][0][lastnonwhite] = '\0'; infield = thisflen = lastnonwhite = 0; } else if (ci.type=='w' || ci.type=='r'){ //extend field thisflen++; //length of string if (thisflen+2 > mlen){ mlen *=2; //length of allocated memory Textrealloc(*fn->text[ct-1], mlen); } fn->text[ct-1][0][thisflen-1] = ci.c; if (ci.type!='w') lastnonwhite = thisflen; } } } while (ci.type != 'n' && ci.type != 'E'); return (line_parse_t) {.ct=ct, .eof= (ci.type == 'E')}; } //On return, fn has copies of the field names, and add_this_line has the first data line. static void get_field_names(int has_col_names, char **field_names, FILE *infile, char *buffer, size_t *ptr, apop_data *add_this_line, apop_data *fn, int const *field_ends, char const *delimiters){ if (has_col_names && field_names == NULL){ while (fn->textsize[0] ==0) parse_a_line(infile, buffer, ptr, fn, field_ends, delimiters); while (add_this_line->textsize[0] ==0) parse_a_line(infile, buffer, ptr, add_this_line, field_ends, delimiters); } else{ while (add_this_line->textsize[0] ==0) parse_a_line(infile, buffer, ptr, add_this_line, field_ends, delimiters); fn = apop_text_alloc(fn, add_this_line->textsize[0], 1); for (int i=0; i< fn->textsize[0]; i++) if (field_names) apop_text_set(fn, i, 0, field_names[i]); else apop_text_set(fn, i, 0, "col_%i", i); } } /** Read a delimited or fixed-wisdth text file into the matrix element of an \ref apop_data set. See \ref text_format. See also \ref apop_text_to_db, which handles text data, and may othewise be a perferable approach to data management. \param text_file = "-" The name of the text file to be read in. If "-" (the default), use stdin. \param has_row_names Does the lines of data have row names? \c 'y' =yes; \c 'n' =no (default: 'n') \param has_col_names Is the top line a list of column names? See \ref text_format for notes on dimension (default: 'y') \param field_ends If fields have a fixed size, give the end of each field, e.g. .field_ends=(int[]){3, 8 11}. (default: \c NULL, indicating not fixed width) \param delimiters A string listing the characters that delimit fields. (default: "|,\t") \return Returns an apop_data set. \exception out->error=='a' allocation error \exception out->error=='t' text-reading error example: See \ref apop_ols. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_text_to_data(char const*text_file, int has_row_names, int has_col_names, int const *field_ends, char const *delimiters){ #else apop_varad_head(apop_data *, apop_text_to_data){ char const *apop_varad_var(text_file, "-") int apop_varad_var(has_row_names, 'n') int apop_varad_var(has_col_names, 'y') if (has_row_names==1||has_row_names=='Y') has_row_names ='y'; if (has_col_names==1||has_col_names=='Y') has_col_names ='y'; int const * apop_varad_var(field_ends, NULL); const char * apop_varad_var(delimiters, apop_opts.input_delimiters); return apop_text_to_data_base(text_file, has_row_names, has_col_names, field_ends, delimiters); } apop_data * apop_text_to_data_base(char const*text_file, int has_row_names, int has_col_names, int const *field_ends, char const *delimiters){ #endif apop_data *set = NULL; FILE *infile = NULL; char *str; char buffer[bs]; size_t ptr=bs; apop_data *add_this_line= apop_data_alloc(); int row = 0, hasrows = (has_row_names == 'y'); Apop_stopif(prep_text_reading(text_file, &infile), apop_return_data_error(t), 0, "trouble opening %s", text_file); line_parse_t L={ }; //First, handle the top line, if we're told that it has column names. if (has_col_names=='y'){ apop_data *field_names = apop_data_alloc(); get_field_names(1, NULL, infile, buffer, &ptr, add_this_line, field_names, field_ends, delimiters); L.ct = *add_this_line->textsize; set = apop_data_alloc(0,1, L.ct - hasrows); set->names->colct = 0; set->names->col = malloc(sizeof(char*)); for (int j=0; j< L.ct - hasrows; j++) apop_name_add(set->names, *field_names->text[j], 'c'); apop_data_free(field_names); } //Now do the body. while(!set || !L.eof || L.ct){ if (!L.ct) { //skip blank lines L=parse_a_line(infile,buffer, &ptr, add_this_line, field_ends, delimiters); continue; } if (!set) set = apop_data_alloc(0, 1, L.ct-hasrows); //for .has_col_names=='n'. row++; int cols = set->matrix ? set->matrix->size2 : L.ct - hasrows; set->matrix = apop_matrix_realloc(set->matrix, row, cols); Apop_stopif(!set->matrix, set->error='a'; return set, 0, "allocation error."); if (hasrows) { apop_name_add(set->names, *add_this_line->text[0], 'r'); Apop_stopif(L.ct-1 > set->matrix->size2, set->error='t'; return set, 1, "row %i (not counting rownames) has %i elements (not counting the rowname), " "but I thought this was a data set with %zu elements per row. " "Stopping the file read; returning what I have so far.", row, L.ct-1, set->matrix->size2); } else Apop_stopif(L.ct > set->matrix->size2, set->error='t'; return set, 1, "row %i has %i elements, " "but I thought this was a data set with %zu elements per row. " "Stopping the file read; returning what I have so far. Set has_row_names?", row, L.ct, set->matrix->size2); for (int col=hasrows; col < L.ct; col++){ char *thisstr = *add_this_line->text[col]; if (strlen(thisstr)){ double val = strtod(thisstr, &str); if (thisstr != str) gsl_matrix_set(set->matrix, row-1, col-hasrows, val); else { gsl_matrix_set(set->matrix, row-1, col-hasrows, GSL_NAN); Apop_notify(1, "trouble converting data item %i on data line %i [%s]; writing NaN.", col, row, thisstr); } } else gsl_matrix_set(set->matrix, row-1, col-hasrows, GSL_NAN); } if (L.eof) break;//hit when the last line has elements and is terminated by EOF. L=parse_a_line(infile, buffer, &ptr, add_this_line, field_ends, delimiters); } apop_data_free(add_this_line); if (strcmp(text_file,"-")) fclose(infile); return set; } /** This is the complement to \ref apop_data_pack, qv. It writes the \c gsl_vector produced by that function back to the \ref apop_data set you provide. It overwrites the data in the vector and matrix elements and, if present, the \c weights (and that's it, so names or text are as before). \param in A \c gsl_vector of the form produced by \ref apop_data_pack. No default; must not be \c NULL. \param d That data set to be filled. Must be allocated to the correct size. No default; must not be \c NULL. \param use_info_pages Pages in XML-style brackets, such as \ will be ignored unless you set .use_info_pages='y'. Be sure that this is set to the same thing when you both pack and unpack. (Default: \c 'n'). \li If I get to the end of the first page of the \c apop_data set and have more entries in the vector to unpack, and the data to fill has a \c more element, then I will continue into subsequent pages. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC void apop_data_unpack(const gsl_vector *in, apop_data *d, char use_info_pages){ #else apop_varad_head(void, apop_data_unpack){ const gsl_vector * apop_varad_var(in, NULL); apop_data* apop_varad_var(d, NULL); Apop_stopif(!d, return, 0, "the data set to be filled, d, must not be NULL"); char apop_varad_var(use_info_pages, 'n'); apop_data_unpack_base(in, d, use_info_pages); } void apop_data_unpack_base(const gsl_vector *in, apop_data *d, char use_info_pages){ #endif int offset = 0; gsl_vector vin, vout; if(d->vector){ vin = gsl_vector_subvector((gsl_vector *)in, 0, d->vector->size).vector; gsl_vector_memcpy(d->vector, &vin); offset += d->vector->size; } if(d->matrix) for (size_t i=0; i< d->matrix->size1; i++){ vin = gsl_vector_subvector((gsl_vector *)in, offset, d->matrix->size2).vector; vout = gsl_matrix_row(d->matrix, i).vector; gsl_vector_memcpy(&vout, &vin); offset += d->matrix->size2; } if(d->weights){ vin = gsl_vector_subvector((gsl_vector *)in, offset, d->weights->size).vector; gsl_vector_memcpy(d->weights, &vin); offset += d->weights->size; } if (offset != in->size && d->more){ vin = gsl_vector_subvector((gsl_vector *)in, offset, in->size - offset).vector; d = d->more; if (use_info_pages=='n') while (d && apop_regex(d->names->title, "^<.*>$")) d = d->more; Apop_stopif(!d, return, 0, "The data set (without info pages, because you didn't ask" " me to use them) is too short for the input vector."); apop_data_unpack(&vin, d); } } static size_t sizecount(const apop_data *in, bool all_pp, bool use_info_pp){ if (!in) return 0; if (!use_info_pp && in->names && apop_regex(in->names->title, "^<.*>$")) return (all_pp ? sizecount(in->more, all_pp, use_info_pp) : 0); return (in->vector ? in->vector->size : 0) + (in->matrix ? in->matrix->size1 * in->matrix->size2 : 0) + (in->weights ? in->weights->size : 0) + (all_pp ? sizecount(in->more, all_pp, use_info_pp) : 0); } /** This function takes in an \ref apop_data set and writes it as a single column of numbers, outputting a \c gsl_vector. It is valid to use the \c out_vector->data element as an array of \c doubles of size \c out_vector->data->size (i.e. its stride==1). The complement is \c apop_data_unpack. I.e., \code apop_data_unpack(apop_data_pack(in_data), data_copy) \endcode will return the original data set (stripped of text and names). \param in an \c apop_data set. No default; if \c NULL, return \c NULL. \param out If this is not \c NULL, then put the output here. The dimensions must match exactly. If \c NULL, then allocate a new data set. Default = \c NULL. \param more_pages If \c 'y', then follow the ->more pointer to fill subsequent pages; else fill only the first page. Informational pages will still be ignored, unless you set .use_info_pages='y' as well. Default = \c 'y'. \param use_info_pages Pages in XML-style brackets, such as \ will be ignored unless you set .use_info_pages='y'. Be sure that this is set to the same thing when you both pack and unpack. Default: 'n'. \return A \c gsl_vector with the vector data (if any), then each row of data (if any), then the weights (if any), then the same for subsequent pages (if any && .more_pages=='y'). If \c out is not \c NULL, then this is \c out. \exception NULL If you give me a vector as input, and its size is not correct, returns \c NULL. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC gsl_vector * apop_data_pack(const apop_data *in, gsl_vector *out, char more_pages, char use_info_pages){ #else apop_varad_head(gsl_vector *, apop_data_pack){ const apop_data * apop_varad_var(in, NULL); if (!in) return NULL; gsl_vector * apop_varad_var(out, NULL); char apop_varad_var(more_pages, 'y'); char apop_varad_var(use_info_pages, 'n'); if (out) { size_t total_size = sizecount(in, (more_pages == 'y' || more_pages == 'Y'), (use_info_pages =='y' || use_info_pages =='Y')); Apop_stopif(out->size != total_size, return NULL, 0, "The input data set has %zu elements, " "but the output vector you want to fill has size %zu. Please make " "these sizes equal.", total_size, out->size); } return apop_data_pack_base(in, out, more_pages, use_info_pages); } gsl_vector * apop_data_pack_base(const apop_data *in, gsl_vector *out, char more_pages, char use_info_pages){ #endif size_t total_size = sizecount(in, (more_pages == 'y' || more_pages == 'Y'), (use_info_pages =='y' || use_info_pages =='Y')); if (!total_size) return NULL; int offset = 0; if (!out) out = gsl_vector_alloc(total_size); gsl_vector vout, vin; if (in->vector){ vout = gsl_vector_subvector((gsl_vector *)out, 0, in->vector->size).vector; gsl_vector_memcpy(&vout, in->vector); offset += in->vector->size; } if (in->matrix) for (size_t i=0; i< in->matrix->size1; i++){ vin = gsl_matrix_row(in->matrix, i).vector; vout= gsl_vector_subvector((gsl_vector *)out, offset, in->matrix->size2).vector; gsl_vector_memcpy(&vout, &vin); offset += in->matrix->size2; } if (in->weights){ vout = gsl_vector_subvector((gsl_vector *)out, offset, in->weights->size).vector; gsl_vector_memcpy(&vout, in->weights); offset += in->weights->size; } if ((more_pages == 'y' ||more_pages =='Y') && in->more){ while (use_info_pages=='n' && in->more && apop_regex(in->more->names->title, "^<.*>$")) in = in->more; if (in->more){ vout = gsl_vector_subvector((gsl_vector *)out, offset, out->size - offset).vector; apop_data_pack(in->more, &vout); } } return out; } /** \def apop_data_falloc Allocate a data set and fill it with values. Put the data set dimensions (one, two, or three dimensions as per \ref apop_data_alloc) in parens, then the data (as per \ref apop_data_fill). E.g.: \code apop_data *identity2 = apop_data_falloc((2,2), 1, 0, 0, 1); apop_data *count_vector = apop_data_falloc((5), 0, 1, 2, 3, 4); \endcode If you forget the parens, you will get an obscure error during compilation. \li This is a simple macro wrapping \ref apop_data_fill and \ref apop_data_alloc, because they appear together so often. The second example expands to: \code apop_data *count_vector = apop_data_fill(apop_data_alloc(5), 0, 1, 2, 3, 4); \endcode */ /** \def apop_data_fill Fill a pre-allocated data set with values. \param adfin An \c apop_data set (that you have already allocated). \param ... A series of at least as many floating-point values as there are blanks in the data set. \return A pointer to the same data set that was input. \li I need as many arguments as the size of the data set, and can't count them for you. Too many will be ignored; too few will produce unpredictable results, which may include padding your matrix with garbage or a simple segfault. \li Underlying this function is a base function that takes a single list, as opposed to the set of unassociated numbers sent to \ref apop_data_fill. See the example below for a comparison. \li This function assumes that if the \ref apop_data set has both \c vector and \c matrix, then vector->size==matrix->size1. \li See also \ref apop_data_falloc to allocate and fill on one line. E.g., to generate a unit vector for three dimensions: \code apop_data *unit_vector = apop_data_falloc((3), 1, 1, 1); \endcode An example, using both a loose list of numbers and an array. \include data_fill.c \see apop_text_fill, apop_data_falloc, apop_data_unpack */ apop_data *apop_data_fill_base(apop_data *in, double ap[]){ /* In conversions.h, you'll find this header, which turns all but the first input into an array of doubles of indeterminate length: #define apop_data_fill(in, ...) apop_data_fill_base((in), (double []) {__VA_ARGS__}) */ if (!in) return NULL; int k=0, start=0, fin=0, height=0; if (in->vector){ start = -1; height = in->vector->size; } if (in->matrix){ fin = in->matrix->size2; height = in->matrix->size1; } for (int i=0; i< height; i++) for (int j=start; j< fin; j++) apop_data_set(in, i, j, ap[k++]); return in; } /** \def apop_vector_fill Fill a pre-allocated \c gsl_vector with values. See \ref apop_data_alloc for a relevant example. See also \ref apop_matrix_alloc. Warning: I need as many arguments as the size of the vector, and can't count them for you. Too many will be ignored; too few will produce unpredictable results, which may include padding your vector with garbage or a simple segfault. \param avfin A \c gsl_vector (that you have already allocated). \param ... A series of exactly as many values as there are spaces in the vector. \return A pointer to the same vector that was input. */ gsl_vector *apop_vector_fill_base(gsl_vector *in, double ap[]){ if (!in) return NULL; for (int i=0; i< in->size; i++) gsl_vector_set(in, i, ap[i]); return in; } /** \def apop_text_fill(in, ap) Fill the text part of an already-allocated \ref apop_data set with a list of strings. \param dataset A data set that you already prepared with \ref apop_text_alloc. \param ... A list of strings. The first row is filled first, then the second, and so on to the end of the text grid. \li If an element is \c NULL, write apop_opts.nan_string at that point. You may prefer to use "" to express a blank. \li If you provide more or fewer strings than are needed to fill the text grid and apop_opts.verbose >=1, I print a warning and continue to the end of the text grid or data set, whichever is shorter. \li If the data set is \c NULL, I return \c NULL. If you provide a \c NULL data set but a non-NULL list of text elements, and apop_opts.verbose >=1, I print a warning and return \c NULL. \li Remember that the C preprocessor concatenates two adjacent strings into one. Here is an attempt to fill a \f$ 2\times 3\f$ grid: \code apop_data *one23 = apop_text_fill(apop_text_alloc(NULL, 2, 3), "one", "two", "three" //missing comma! "two", "four", "six"); \endcode The preprocessor will join "three" "two" to form "threetwo", leaving you with only five strings. \li If you have a \c NULL-delimited array of strings (not just a loose list as above), then use \c apop_text_fill_base. */ apop_data *apop_text_fill_base(apop_data *data, char* text[]){ int textct = 0; for (char **textptr = text; *textptr; textptr++) textct++; Apop_stopif(!data && textct, return NULL, 1, "NULL data set input; returning NULL."); if (!data) return NULL; int gridsize = data ? data->textsize[0]*data->textsize[1] : 0; Apop_stopif(textct != gridsize, /*continue*/, 1, "Data set has a text grid " "of size %i but you gave me %i strings.", gridsize, textct); int ctr=0; for (int i=0; i< data->textsize[0]; i++) for (int j=0; j< data->textsize[1]; j++) apop_text_set(data, i, j, text[ctr++]); return data; } ///////The rest of this file is for apop_text_to_db extern sqlite3 *db; static char *get_field_conditions(char *var, apop_data *field_params){ if (field_params) for (int i=0; itextsize[0]; i++) if (apop_regex(var, field_params->text[i][0])) return field_params->text[i][1]; return (apop_opts.db_engine == 'm') ? "varchar(100)" : "numeric"; } static int tab_create_mysql(char *tabname, int has_row_names, apop_data *field_params, char *table_params, apop_data const *fn){ char *q = NULL; Asprintf(&q, "create table %s", tabname); for (int i=0; i < *fn->textsize; i++){ if (i==0) xprintf(&q, has_row_names ? "%s (row_names varchar(100), " : "%s (", q); else xprintf(&q, "%s %s, ", q, get_field_conditions(*fn->text[i-1], field_params)); xprintf(&q, "%s %s", q, *fn->text[i]); } xprintf(&q, "%s %s%s%s)", q, get_field_conditions(*fn->text[fn->textsize[0]-1], field_params) , table_params? ", ": "", XN(table_params)); apop_query("%s", q); Apop_stopif(!apop_table_exists(tabname), return -1, 0, "query \"%s\" failed.", q); free(q); return 0; } static int tab_create_sqlite(char *tabname, int has_row_names, apop_data *field_params, char *table_params, apop_data const *fn){ char *q = NULL; Asprintf(&q, "create table %s", tabname); for (int i=0; itextsize[0]; i++){ if (i==0){ if (has_row_names) xprintf(&q, "%s ('row_names', ", q); else xprintf(&q, "%s (", q); } else xprintf(&q, "%s' %s, ", q, get_field_conditions(*fn->text[i-1], field_params)); xprintf(&q, "%s '%s", q, *fn->text[i]); } xprintf(&q, "%s' %s%s%s);", q, get_field_conditions(*fn->text[fn->textsize[0]-1], field_params) , table_params? ", ": "", XN(table_params)); apop_query("%s", q); Apop_stopif(!apop_table_exists(tabname), return -1, 0, "query \"%s\" failed.", q); free(q); return 0; } /** --If the string has zero length, then it's probably a missing value. --If the string isn't a number, it needs quotes */ char *prep_string_for_sqlite(int prepped_statements, char const *astring){ if (!astring || astring[0]=='\0' || (apop_opts.nan_string && !strcasecmp(apop_opts.nan_string, astring))) return NULL; char *out = NULL, *tail = NULL; if(strtod(astring, &tail)) /*do nothing.*/; if (*tail!='\0'){ //then it's not a number. if (!prepped_statements){ if (strchr(astring, '\'')) Asprintf(&out,"\"%s\"", astring); else Asprintf(&out,"'%s'", astring); } else out = strdup(astring); } else { //number, maybe INF or NAN. Also, sqlite wants 0.1, not .1 assert(*astring!='\0'); if (isinf(atof(astring))==1) out = strdup("9e9999999"); else if (isinf(atof(astring))==-1) out = strdup("-9e9999999"); else if (gsl_isnan(atof(astring))) out = strdup("0.0/0.0"); else if (astring[0]=='.') Asprintf(&out, "0%s",astring); else out = strdup(astring); } return out; } static void line_to_insert(line_parse_t L, apop_data const*addme, char const *tabname, sqlite3_stmt *p_stmt, int row){ if (!L.ct) return; int field = 1; char comma = ' '; char *q = NULL; if (!p_stmt) Asprintf(&q, "INSERT INTO %s VALUES (", tabname); for (int col=0; col < L.ct; col++){ char *prepped = prep_string_for_sqlite(!!p_stmt, *addme->text[col]); if (p_stmt){ if (!prepped || !strlen(prepped)) field++; //leave NULL and cleared else Apop_stopif(sqlite3_bind_text(p_stmt, field++, prepped, -1, SQLITE_TRANSIENT)!=SQLITE_OK, /*keep going */, 0, "Something wrong on line %i, field %i [%s].\n" , row, field-1, *addme->text[col]); } else { xprintf(&q, "%s%c %s", q, comma, (prepped && strlen(prepped) ? prepped : " NULL")); comma = ','; } free(prepped); } if (!p_stmt){ apop_query("%s)",q); free (q); } } int apop_use_sqlite_prepared_statements(size_t col_ct){ #if SQLITE_VERSION_NUMBER < 3003009 return 0; #else return (sqlite3_libversion_number() >=3003009 && !(apop_opts.db_engine == 'm') && col_ct <= 999); //Arbitrary SQLite limit on blanks in prepared statements. #endif } int apop_prepare_prepared_statements(char const *tabname, size_t col_ct, sqlite3_stmt **statement){ #if SQLITE_VERSION_NUMBER < 3003009 Apop_stopif(1, return -1, 0, "Attempting to prepapre prepared statements, but using a version of SQLite that doesn't support them."); #else char *q=NULL; Asprintf(&q, "INSERT INTO %s VALUES (", tabname); for (size_t i = 0; i < col_ct; i++) xprintf(&q, "%s?%c", q, i==col_ct-1 ? ')' : ','); Apop_stopif(!db, return -1, 0, "The database should be open by now but isn't."); Apop_stopif(sqlite3_prepare_v2(db, q, -1, statement, NULL) != SQLITE_OK, return -1, apop_errorlevel, "Failure preparing prepared statement: %s", sqlite3_errmsg(db)); free(q); return 0; #endif } char *cut_at_dot(char const *infile){ char *out = strdup(basename(infile)); for (char *c = out; *c; c++) if (*c=='.') {*c='\0'; return out;} return out; } /** Read a delimited or fixed-width text file into a database table. See \ref text_format. For purely numeric data, you may be able to bypass the database by using \ref apop_text_to_data. See the \ref apop_ols page for an example that uses this function to read in sample data (also listed on that page). Apophenia ships with an \c apop_text_to_db command-line utility, which is a wrapper for this function. Especially if you are using a pre-2007 version of SQLite, there may be a speedup to putting this function in a begin/commit wrapper: \code apop_query("begin;"); apop_data_print(dataset, .output_name="dbtab", .output_type='d'); apop_query("commit;"); \endcode \param text_file The name of the text file to be read in. If \c "-", then read from \c STDIN. (default: "-") \param tabname The name to give the table in the database (default: \c text_file after the last slash and up to the next dot. E.g., text_file=="../data/pant_lengths.csv" gives tabname=="pant_lengths") \param has_row_names Does the lines of data have row names? (default: 0) \param has_col_names Is the top line a list of column names? (default: 1) \param field_names The list of field names, which will be the columns for the table. If has_col_names==1, read the names from the file (and just set this to NULL). If has_col_names == 1 && field_names !=NULL, I'll use the field names. (default: NULL) \param field_ends If fields have a fixed size, give the end of each field, e.g. .field_ends=(int[]){3, 8 11}. (default: \c NULL, indicating not fixed width) \param field_params There is an implicit create table in setting up the database. If you want to add a type, constraint, or key, put that here. The relevant part of the input \ref apop_data set is the \c text grid, which should be \f$N \times 2\f$. The first item in each row (your_params->text[n][0], for each \f$n\f$) is a regular expression to match against the variable names; the second item (your_params->text[n][1]) is the type, constraint, and/or key (i.e., what comes after the name in the \c create query). Not all variables need be mentioned; the default type if nothing matches is numeric. I go in order until I find a regex that matches the given field, so if you don't like the default, then set the last row to have name .*, which is a regex guaranteed to match anything that wasn't matched by an earlier row, and then set the associated type to your preferred default. See \ref apop_regex on details of matching. (default: NULL) \param table_params There is an implicit create table in setting up the database. If you want to add a table constraint or key, such as not null primary key (age, sex), put that here. \param delimiters A string listing the characters that delimit fields. default = "|,\t" \param if_table_exists What should I do if the table exists?
\c 'n' Do nothing; exit this function. (default)
\c 'd' Retain the table but delete all data; refill with the new data (i.e., call "delete * from your_table").
\c 'o' Overwrite the table from scratch; deleting the previous table entirely.
\c 'a' Append new data to the existing table. \return Returns the number of rows on success, -1 on error. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC int apop_text_to_db(char const *text_file, char *tabname, int has_row_names, int has_col_names, char **field_names, int const *field_ends, apop_data *field_params, char *table_params, char const *delimiters, char if_table_exists){ #else apop_varad_head(int, apop_text_to_db){ char const *apop_varad_var(text_file, "-") char *apop_varad_var(tabname, cut_at_dot(text_file)) int apop_varad_var(has_row_names, 'n') int apop_varad_var(has_col_names, 'y') if (has_row_names==1||has_row_names=='Y') has_row_names ='y'; if (has_col_names==1||has_col_names=='Y') has_col_names ='y'; int const *apop_varad_var(field_ends, NULL) char ** apop_varad_var(field_names, NULL) apop_data * apop_varad_var(field_params, NULL) char * apop_varad_var(table_params, NULL) const char * apop_varad_var(delimiters, apop_opts.input_delimiters); char apop_varad_var(if_table_exists, 'n') return apop_text_to_db_base(text_file, tabname, has_row_names, has_col_names, field_names, field_ends, field_params, table_params, delimiters, if_table_exists); } int apop_text_to_db_base(char const *text_file, char *tabname, int has_row_names, int has_col_names, char **field_names, int const *field_ends, apop_data *field_params, char *table_params, char const *delimiters, char if_table_exists){ #endif int batch_size = 10000, col_ct, ct = 0, rows = 1; FILE *infile; char buffer[bs]; size_t ptr = bs; apop_data *add_this_line = apop_data_alloc(); sqlite3_stmt *statement = NULL; line_parse_t L = {1,0}; bool tab_exists = apop_table_exists(tabname); if (tab_exists){ Apop_stopif(if_table_exists=='n', return -1, 0, "table %s exists; not recreating it.", tabname); if (if_table_exists=='d') apop_query("delete from %s", tabname); else if (if_table_exists=='o') { apop_query("drop table %s", tabname); tab_exists=false; } } //get names and the first row. if (prep_text_reading(text_file, &infile)) return -1; apop_data *fn = apop_data_alloc(); get_field_names(has_col_names=='y', field_names, infile, buffer, &ptr, add_this_line, fn, field_ends, delimiters); col_ct = L.ct = *add_this_line->textsize; Apop_stopif(!col_ct, return -1, 0, "counted zero columns in the input file (%s).", tabname); if (!tab_exists) Apop_stopif( ((apop_opts.db_engine=='m') ? tab_create_mysql : tab_create_sqlite)(tabname, has_row_names=='y', field_params, table_params, fn), return -1, 0, "Creating the table in the database failed."); #if SQLITE_VERSION_NUMBER < 3003009 Apop_notify(1, "Apophenia was compiled using a version of SQLite from mid-2007 or earlier. " "The code for reading in text files using such an old version is no longer supported, " "so if errors crop up please see about installing a more recent version of SQLite's library."); #endif int use_sqlite_prepared_statements = apop_use_sqlite_prepared_statements(col_ct); if (use_sqlite_prepared_statements) Apop_stopif(apop_prepare_prepared_statements(tabname, col_ct, &statement), return -1, 0, "Trouble preparing the prepared statement for SQLite."); //done with table & query setup. //convert a data line into SQL: insert into TAB values (0.3, 7, "et cetera"); while(L.ct && !L.eof){ line_to_insert(L, add_this_line, tabname, statement, rows); if (apop_opts.verbose > 1 && !(ct++ % batch_size)) {fprintf(stderr, "."); fflush(NULL);} if (use_sqlite_prepared_statements){ int err = sqlite3_step(statement); if (err!=0 && err != 101) //0=ok, 101=done Apop_notify(0, "sqlite insert query gave error code %i.\n", err); Apop_assert_c(!sqlite3_reset(statement), -1, apop_errorlevel, "SQLite error."); #if SQLITE_VERSION_NUMBER >= 3003009 Apop_assert_c(!sqlite3_clear_bindings(statement), -1, apop_errorlevel, "SQLite error."); //needed for NULLs #endif } do { L = parse_a_line(infile, buffer, &ptr, add_this_line, field_ends, delimiters); rows ++; } while (!L.ct && !L.eof); //skip blank lines } apop_data_free(add_this_line); #if SQLITE_VERSION_NUMBER >= 3003009 if (use_sqlite_prepared_statements){ Apop_assert_c(sqlite3_finalize(statement) ==SQLITE_OK, -1, apop_errorlevel, "SQLite error."); } #endif if (strcmp(text_file,"-")) fclose(infile); return rows; } apophenia-1.0+ds/apop_data.c000066400000000000000000002156451262736346100160430ustar00rootroot00000000000000 /** \file The apop_data structure joins together a gsl_matrix, apop_name, and a table of strings. */ /* Copyright (c) 2006--2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" //apop_gsl_error is in apop_linear_algebra.c #define Set_gsl_handler gsl_error_handler_t *prior_handler = gsl_set_error_handler(apop_gsl_error); #define Unset_gsl_handler gsl_set_error_handler(prior_handler); /** Allocate an \ref apop_data structure. \li The typical case is three arguments, like apop_data_alloc(2,3,4): vector size, matrix rows, matrix cols. If the first argument is zero, you get a \c NULL vector. \li Two arguments, apop_data_alloc(2,3), would allocate just a matrix, leaving the vector \c NULL. \li One argument, apop_data_alloc(2), would allocate just a vector, leaving the matrix \c NULL. \li Zero arguments, apop_data_alloc(), will produce a basically blank set, with \c out->matrix and \c out->vector set to \c NULL. For allocating the text part, see \ref apop_text_alloc. The \c weights vector is set to \c NULL. If you need it, allocate it via \code d->weights = gsl_vector_alloc(row_ct); \endcode \return The \ref apop_data structure, allocated and ready to be populated with data. \exception out->error=='a' Allocation error. The matrix, vector, or names couldn't be malloced, which probably means that you requested a very large data set. \li An \ref apop_data struct, by itself, is about 72 bytes. If I can't allocate that much memory, I return \c NULL. But if even this much fails, your computer may be on fire and you should go put it out. \li This function uses the \ref designated syntax for inputs. \see apop_data_calloc */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_alloc(const size_t size1, const size_t size2, const int size3){ #else apop_varad_head(apop_data *, apop_data_alloc){ const size_t apop_varad_var(size1, 0); const size_t apop_varad_var(size2, 0); const int apop_varad_var(size3, 0); return apop_data_alloc_base(size1, size2, size3); } apop_data * apop_data_alloc_base(const size_t size1, const size_t size2, const int size3){ #endif size_t vsize=0, msize1=0; int msize2=0; if (size3){ vsize = size1; msize1 = size2; msize2 = size3; } else if (size2) { msize1 = size1; msize2 = size2; } else vsize = size1; apop_data *setme = malloc(sizeof(apop_data)); Apop_stopif(!setme, return NULL, -5, "malloc failed. Probably out of memory."); *setme = (apop_data) { }; //init to zero/NULL. Set_gsl_handler if (msize2 > 0 && msize1 > 0){ setme->matrix = gsl_matrix_alloc(msize1,msize2); Apop_stopif(!setme->matrix, setme->error='a'; return setme, 0, "malloc failed on a %zu x %i matrix. Probably out of memory.", msize1, msize2); } if (vsize){ setme->vector = gsl_vector_alloc(vsize); Apop_stopif(!setme->vector, setme->error='a'; return setme, 0, "malloc failed on a vector of size %zu. Probably out of memory.", vsize); } Unset_gsl_handler setme->names = apop_name_alloc(); Apop_stopif(!setme->names, setme->error='a'; return setme, 0, "couldn't allocate names. Probably out of memory."); return setme; } /** Allocate a \ref apop_data structure, to be filled with data; set everything in the allocated portion to zero. See \ref apop_data_alloc for details. \return The \ref apop_data structure, allocated and zeroed out. \exception out->error=='a' allocation error; probably out of memory. \li This function uses the \ref designated syntax for inputs. \see apop_data_alloc */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_calloc(const size_t size1, const size_t size2, const int size3){ #else apop_varad_head(apop_data *, apop_data_calloc){ const size_t apop_varad_var(size1, 0); const size_t apop_varad_var(size2, 0); const int apop_varad_var(size3, 0); return apop_data_calloc_base(size1, size2, size3); } apop_data * apop_data_calloc_base(const size_t size1, const size_t size2, const int size3){ #endif size_t vsize=0, msize1=0; int msize2=0; if (size3){ vsize = size1; msize1 = size2; msize2 = size3; } else if (size2) { msize1 = size1; msize2 = size2; } else vsize = size1; apop_data *setme = malloc(sizeof(apop_data)); Apop_stopif(!setme, apop_return_data_error('a'), 0, "malloc failed. Probably out of memory."); *setme = (apop_data) { }; //init to zero/NULL. if (msize2 >0 && msize1 > 0){ setme->matrix = gsl_matrix_calloc(msize1,msize2); Apop_stopif(!setme->matrix, apop_return_data_error('a'), 0, "malloc failed on a %zu x %i matrix. Probably out of memory.", msize1, msize2); } if (vsize){ setme->vector = gsl_vector_calloc(vsize); Apop_stopif(!setme->vector, apop_return_data_error('a'), 0, "malloc failed on a vector of size %zu. Probably out of memory.", vsize); } setme->names = apop_name_alloc(); return setme; } /*For a touch of space saving, blank strings in a text grid all point to the same nul string. */ char *apop_nul_string = ""; static void apop_text_blank(apop_data *in, const size_t row, const size_t col){ if (in->text[row][col] != apop_nul_string) free(in->text[row][col]); in->text[row][col] = apop_nul_string; } /** Free a matrix of chars* (i.e., a char***). This is what \c apop_data_free uses internally to deallocate the \c text element of an \ref apop_data set. You may never need to use it directly. Sample usage: \code apop_text_free(yourdata->text, yourdata->textsize[0], yourdata->textsize[1]); \endcode */ void apop_text_free(char ***freeme, int rows, int cols){ if (rows && cols) for (int i=0; i < rows; i++){ for (int j=0; j < cols; j++) if(freeme[i][j]!=apop_nul_string) free(freeme[i][j]); free(freeme[i]); } free(freeme); } /** Free the elements of the given \ref apop_data set and then the \ref apop_data set itself. Intended to be used by \ref apop_data_free, a macro that calls this to free elements, then sets the value to \c NULL. \li \ref apop_data_free is a macro that calls this function and, on success, sets the input pointer to \c NULL. For typical cases, that's slightly more useful than this function. \exception freeme.error='c' Circular linking is against the rules. If freeme->more == freeme, then I set freeme.error='c' and return. If you send in a structure like A -> B -> B, then both data sets A and B will be marked. \return \c 0 on OK, \c 'c' on error. */ char apop_data_free_base(apop_data *freeme){ if (!freeme) return 0; if (freeme->more){ Apop_stopif(freeme == freeme->more, freeme->error='c'; return 'c', 1, "the ->more element of this data set equals the data set itself. " "This is not healthy. Not freeing; marking your data set with error='c'."); if (apop_data_free_base(freeme->more)) Apop_stopif(freeme->more->error == 'c', freeme->error='c'; return 'c', 1, "Propogating error code to parent data set"); } if (freeme->vector) gsl_vector_free(freeme->vector); if (freeme->matrix) gsl_matrix_free(freeme->matrix); if (freeme->weights) gsl_vector_free(freeme->weights); apop_name_free(freeme->names); apop_text_free(freeme->text, freeme->textsize[0] , freeme->textsize[1]); free(freeme); return 0; } /** Copy one \ref apop_data structure to another. This function does not allocate the output structure or the vector, matrix, text, or weights elements---I assume you have already done this and got the dimensions right. I will assert that there is at least enough room in the destination for your data, and fail if the copy would write more elements than there are bins. \li If you want space allocated or are unsure about dimensions, use \ref apop_data_copy. \li If both \c in and \c out have a \c more pointer, also copy subsequent page(s). \li You can use the subsetting macros, \ref Apop_r, \ref Apop_rs, \ref Apop_c, and so on, to copy within a data set: \code //Copy the contents of row i of mydata to row j. apop_data *fromrow = Apop_r(mydata, i); apop_data *torow = Apop_r(mydata, j); apop_data_memcpy(torow, fromrow); // or just apop_data_memcpy(Apop_r(mydata, i), Apop_r(mydata, j)); \endcode \param out A structure that this function will fill. Must be preallocated with the appropriate sizes. \param in The input data. \exception out.error='d' Dimension error. \exception out.error='p' Part missing; e.g., in->matrix exists but out->matrix doesn't. */ void apop_data_memcpy(apop_data *out, const apop_data *in){ Apop_stopif(!out, return, 0, "you are copying to a NULL matrix. Do you mean to use apop_data_copy instead?"); Apop_stopif(out==in, return, 1, "out==in. Doing nothing."); if (in->matrix){ Apop_stopif(!out->matrix, out->error='p'; return, 1, "in->matrix exists but out->matrix does not."); Apop_stopif(in->matrix->size1 != out->matrix->size1 || in->matrix->size2 != out->matrix->size2, out->error='d'; return, 1, "you're trying to copy a (%zu X %zu) into a (%zu X %zu) matrix.", in->matrix->size1, in->matrix->size2, out->matrix->size1, out->matrix->size2); gsl_matrix_memcpy(out->matrix, in->matrix); } if (in->vector){ Apop_stopif(!out->vector, out->error='p'; return, 1, "in->vector exists but out->vector does not."); Apop_stopif(in->vector->size != out->vector->size, out->error='d'; return, 1, "You're trying to copy a %zu-elmt " "vector into a %zu-elmt vector.", in->vector->size, out->vector->size); gsl_vector_memcpy(out->vector, in->vector); } if (in->weights){ Apop_stopif(!out->weights, out->error='p'; return, 1, "in->weights exists but out->weights does not."); Apop_stopif(in->weights->size != out->weights->size, out->error='d'; return, 1, "Weight vector sizes don't match: " "you're trying to copy a %zu-elmt vector into a %zu-elmt vector.", in->weights->size, out->weights->size); gsl_vector_memcpy(out->weights, in->weights); } if (in->names){ if (!out->names) out->names = apop_name_alloc(); Asprintf(&out->names->title, "%s", in->names->title); if (out->names->vector && in->names->vector) {Asprintf(&out->names->vector, "%s", in->names->vector);} for (int i=0; i< in->names->rowct; i++) if (i< out->names->rowct) {Asprintf(out->names->row+i, "%s", in->names->row[i]);} else apop_name_add(out->names, in->names->row[i], 'r'); for (int i=0; i< in->names->colct; i++) if (i< out->names->colct) {Asprintf(out->names->col+i, "%s", in->names->col[i]);} else apop_name_add(out->names, in->names->col[i], 'c'); for (int i=0; i< in->names->textct; i++) if (i< out->names->textct) {Asprintf(out->names->text+i, "%s", in->names->text[i]);} else apop_name_add(out->names, in->names->text[i], 't'); } out->textsize[0] = in->textsize[0]; out->textsize[1] = in->textsize[1]; if (in->textsize[0] && in->textsize[1]){ Apop_stopif(out->textsize[0] < in->textsize[0] || out->textsize[1] < in->textsize[1], out->error='d'; return, 1, "I am trying to copy a grid of (%zu, %zu) text elements into a grid of (%zu, %zu), " "and that won't work. Please use apop_text_alloc to reallocate the right amount of data, " "or use apop_data_copy for automatic allocation.", in->textsize[0] , in->textsize[1] , out->textsize[0] , out->textsize[1]); for (size_t i=0; i< in->textsize[0]; i++) for(size_t j=0; j < in->textsize[1]; j ++) if (in->text[i][j] == apop_nul_string) apop_text_blank(out, i, j); else apop_text_set(out, i, j, "%s", in->text[i][j]); } if (in->more && out->more) apop_data_memcpy(out->more, in->more); } /** Copy one \ref apop_data structure to another. That is, all data is duplicated. Basically a front-end for \ref apop_data_memcpy for those who prefer this sort of syntax. If the data set has a \c more pointer, that will be followed and subsequent pages copied as well. \param in the input data \return a structure that this function will allocate and fill. If input is NULL, then this will be NULL. \exception out.error='a' Allocation error. \exception out.error='c' Cyclic link: D->more == D (may be later in the chain, e.g., D->more->more = D->more) You'll have only a partial copy. \exception out.error='d' Dimension error; should never happen. \exception out.error='p' Missing part error; should never happen. \li If the input data set has an error, then I will copy it anyway, including the error flag (which might be overwritten). I print a warning if the verbosity level is >=1. */ apop_data *apop_data_copy(const apop_data *in){ if (!in) return NULL; apop_data *out = apop_data_alloc(); Apop_stopif(out->error, return out, 0, "Allocation error."); if (in->error){ Apop_notify(1, "the data set to be copied has an error flag of %c. Copying it.", in->error); out->error = in->error; } if (in->more){ Apop_stopif(in == in->more, out->error='c'; return out, 0, "the ->more element of this data set equals the " "data set itself. This is not healthy. Made a partial copy and set out.error='c'."); out->more = apop_data_copy(in->more); Apop_stopif(out->more->error, out->error=out->more->error; return out, 0, "propagating an error in the ->more element to the parent apop_data set. Only a partial copy made."); } if (in->vector){ out->vector = gsl_vector_alloc(in->vector->size); Apop_stopif(!out->vector, out->error='a'; return out, 0, "Allocation error on vector of size %zu.", in->vector->size); } if (in->matrix){ out->matrix = gsl_matrix_alloc(in->matrix->size1, in->matrix->size2); Apop_stopif(!out->matrix, out->error='a'; return out, 0, "Allocation error on matrix " "of size %zu X %zu.", in->matrix->size1, in->matrix->size2); } if (in->weights){ out->weights = gsl_vector_alloc(in->weights->size); Apop_stopif(!out->weights, out->error='a'; return out, 0, "Allocation error on weights vector of size %zu.", in->weights->size); } if (in->textsize[0] && in->textsize[1]){ apop_text_alloc(out, in->textsize[0], in->textsize[1]); Apop_stopif(out->error, return out, 0, "Allocation error on text grid of size %zu X %zu.", in->textsize[0], in->textsize[1]); } apop_data_memcpy(out, in); return out; } /** Put the first data set either on top of or to the left of the second data set. For the opposite operation, see \ref apop_data_split. \param m1 the upper/rightmost data set (default = \c NULL) \param m2 the second data set (default = \c NULL) \param posn If 'r', stack rows of m1 above rows of m2
if 'c', stack columns of m1 to left of m2's
(default = 'r') \param inplace If \c 'y', use \ref apop_matrix_realloc and \ref apop_vector_realloc to modify \c m1 in place. Otherwise, allocate a new \ref apop_data set, leaving \c m1 undisturbed. (default='n') \return The stacked data, either in a new \ref apop_data set or \c m1 \exception out->error=='a' Allocation error. \exception out->error=='d' Dimension error; couldn't make a complete copy. \li The function returns a new data set, meaning that until you apop_data_free() the original data sets, you will be taking up twice as much memory. \li If m1 or m2 are \c NULL, returns a copy of the other element, and if both are \c NULL, returns \c NULL. If \c m2 is \c NULL and \c inplace is \c 'y', returns the original \c m1 pointer unmodified. \li Text is handled as you'd expect: If 'r', one set of text is stacked on top of the other [number of columns must match]; if 'c', one set of text is set next to the other [number of rows must match]. \li \c more is ignored. \li If stacking rows on rows, the output vector is the input vectors stacked accordingly. If stacking columns by columns, the output vector is just a copy of the vector of \c m1 and m2->vector doesn't appear in the output at all. \li The same rules for dealing with the vector(s) hold for the vector(s) of weights. \li Names are a copy of the names for \c m1, with the names for \c m2 appended to the row or column list, as appropriate. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_stack(apop_data *m1, apop_data * m2, char posn, char inplace){ #else apop_varad_head(apop_data *, apop_data_stack){ apop_data * apop_varad_var(m1, NULL) apop_data * apop_varad_var(m2, NULL) char apop_varad_var(posn, 'r') Apop_stopif(!(posn == 'r' || posn == 'c'), return NULL, 0, "Valid positions are 'r' or 'c'" " you gave me '%c'. Returning NULL.", posn); char apop_varad_var(inplace, 'n') inplace = (inplace == 'y' || inplace == 1 || inplace == 'Y') ? 1 : 0; return apop_data_stack_base(m1, m2, posn, inplace); } apop_data * apop_data_stack_base(apop_data *m1, apop_data * m2, char posn, char inplace){ #endif if (!m1) return apop_data_copy(m2); if (!m2) return inplace ? m1 : apop_data_copy(m1); apop_data *out = NULL; if (inplace) out = m1; else { apop_data *m = m1->more; //not following the more pointer. m1->more =NULL; out = apop_data_copy(m1); Apop_stopif(out->error, return out, 0, "initial copy failed; leaving."); m1->more = m; } Get_vmsizes(m1); //original sizes of vsize, msize1, msize2. if (m2->names && !out->names) out->names = apop_name_alloc(); if (posn == 'c'){ if (m2->vector && out->vector){ gsl_matrix_view mview = gsl_matrix_view_vector(m2->vector, m2->vector->size, 1); out->matrix = apop_matrix_stack(out->matrix, &mview.matrix, posn, .inplace='y'); apop_name_stack(out->names, m2->names, 'c', 'v'); if (m2->names && !m2->names->vector && m2->names->colct) apop_name_add(out->names, "v", 'c'); } if (m2->vector && !out->vector) { out->vector= apop_vector_copy(m2->vector); if (m2->names->vector) apop_name_add(out->names, m2->names->vector, 'v'); } } out->matrix = apop_matrix_stack(out->matrix, m2->matrix, posn, .inplace='y'); if (posn == 'r'){ out->vector = apop_vector_stack(out->vector, m2->vector, .inplace='y'); out->weights = apop_vector_stack(out->weights, m2->weights, .inplace='y'); } if (m2->text){ //we've already copied m1->text, if any, so if m2->text is NULL, we're done. if (posn=='r'){ Apop_stopif(out->text && m2->textsize[1]!=out->textsize[1], out->error='d'; return out, 0, "The first data set has %zu columns of text and the second has %zu columns. " "I can't stack that.", out->textsize[1], m2->textsize[1]); int basetextsize = out->textsize[0]; apop_text_alloc(out, basetextsize+m2->textsize[0], m2->textsize[1]); Apop_stopif(out->error, return out, 0, "Allocation error."); for(int i=0; i< m2->textsize[0]; i++) for(int j=0; j< m2->textsize[1]; j++) if (m2->text[i][j] == apop_nul_string) apop_text_blank(out, i+basetextsize, j); else apop_text_set(out, i+basetextsize, j, "%s", m2->text[i][j]); } else { Apop_stopif(out->text && m2->textsize[0]!=out->textsize[0], out->error='d'; return out, 0, "The first data set has %zu rows of text and the second has %zu rows. " "I can't stack that.", out->textsize[0], m2->textsize[0]); int basetextsize = out->textsize[1]; apop_text_alloc(out, m2->textsize[0], basetextsize+m2->textsize[1]); Apop_stopif(out->error, out->error='a'; return out, 0, "Allocation error."); for(int i=0; i< m2->textsize[0]; i++) for(int j=0; j< m2->textsize[1]; j++) if (m2->text[i][j] == apop_nul_string) apop_text_blank(out, i, j+basetextsize); else apop_text_set(out, i, j+basetextsize, "%s", m2->text[i][j]); apop_name_stack(out->names, m2->names, 't'); } } if ((posn=='r' && m2->names && m2->names->rowct) || (posn=='c' && m2->names && m2->names->colct)){ int min = posn =='r' ? m1->names->rowct : m1->names->colct; int max = posn =='r' ? GSL_MAX(vsize, msize1) : msize2; for (int k = min; k < max; k++) //pad so the name stacking is aligned (if needed) apop_name_add(out->names, "", posn); apop_name_stack(out->names, m2->names, posn); } return out; } /** Split one input \ref apop_data structure into two. For the opposite operation, see \ref apop_data_stack. \param in The \ref apop_data structure to split \param splitpoint The index of what will be the first row/column of the second data set. E.g., if this is -1 and \c r_or_c=='c', then the whole data set will be in the second data set; if this is the length of the matrix then the whole data set will be in the first data set. Another way to put it is that for values between zero and the matrix's size, \c splitpoint will equal the number of rows/columns in the first matrix. \param r_or_c If this is 'r' or 'R', then put some rows in the first data set and some in the second; of 'c' or 'C', split columns into first and second data sets. \return An array of two \ref apop_data sets. If one is empty then a \c NULL pointer will be returned in that position. For example, for a data set of 50 rows, apop_data **out = apop_data_split(data, 100, 'r') sets out[0] = apop_data_copy(data) and out[1] = NULL. \li When splitting at a row, the text is also split. \li The \c more pointer is ignored. \li The apop_data->vector is taken to be the -1st element of the matrix. \li Weights will be preserved. If splitting by rows, then the top and bottom parts of the weights vector will be assigned to the top and bottom parts of the main data set. If splitting by columns, identical copies of the weights vector will be assigned to both parts. \li Data is copied, so you may want to call apop_data_free(in) after this. */ apop_data ** apop_data_split(apop_data *in, int splitpoint, char r_or_c){ //A long, dull series of contingencies. Bonus: a reasonable use of goto. apop_data **out = malloc(2*sizeof(apop_data *)); out[0] = out[1] = NULL; Apop_stopif(!in, return out, 1, "input was NULL; output will be an array of two NULLs."); gsl_vector v1, v2, w1, w2; gsl_matrix m1, m2; int set_v1 = 1, set_v2 = 1, set_m1 = 1, set_m2 = 1, set_w1 = 1, set_w2 = 1, namev0 = 0, namev1 = 0, namer0 = 0, namer1 = 0, namec0 = 0, namec1 = 0, namersplit = -1, namecsplit = -1; if (r_or_c == 'r' || r_or_c == 'R') { if (splitpoint <=0) out[1] = apop_data_copy(in); else if (in->matrix && splitpoint >= in->matrix->size1) out[0] = apop_data_copy(in); else { namev0 = namev1 = namec0 = namec1 = 1; if (in->vector){ v1 = gsl_vector_subvector(in->vector, 0, splitpoint).vector; v2 = gsl_vector_subvector(in->vector, splitpoint, in->vector->size - splitpoint).vector; } else set_v1 = set_v2 = 0; if (in->weights){ w1 = gsl_vector_subvector(in->weights, 0, splitpoint).vector; w2 = gsl_vector_subvector(in->weights, splitpoint, in->weights->size - splitpoint).vector; } else set_w1 = set_w2 = 0; if (in->matrix){ m1 = gsl_matrix_submatrix (in->matrix, 0, 0, splitpoint, in->matrix->size2).matrix; m2 = gsl_matrix_submatrix (in->matrix, splitpoint, 0, in->matrix->size1 - splitpoint, in->matrix->size2).matrix; } else set_m1 = set_m2 = 0; namersplit=splitpoint; goto allocation; } } else if (r_or_c == 'c' || r_or_c == 'C') { if (in->weights){ w1 = gsl_vector_subvector(in->weights, 0, in->weights->size).vector; w2 = gsl_vector_subvector(in->weights, 0, in->weights->size).vector; } else set_w1 = set_w2 = 0; namer0 = 1; namer1 = 1; if (splitpoint <= -1) out[1] = apop_data_copy(in); else if (in->matrix && splitpoint >= in->matrix->size2) out[0] = apop_data_copy(in); else if (splitpoint == 0){ if (in->vector){ v1 = gsl_vector_subvector(in->vector, 0, in->vector->size).vector; namev0 = 1; } else set_v1 = 0; set_v2 = 0; set_m1 = 0; if (in->matrix){ m2 = gsl_matrix_submatrix (in->matrix, 0, 0, in->matrix->size1, in->matrix->size2).matrix; namec1 = 1; } else set_m2 = 0; goto allocation; } else if (splitpoint > 0 && in->matrix && splitpoint < in->matrix->size2){ if (in->vector){ v1 = gsl_vector_subvector(in->vector, 0, in->vector->size).vector; namev0 = 1; } else set_v1 = 0; set_v2 = 0; if (in->matrix){ m1 = gsl_matrix_submatrix (in->matrix, 0, 0, in->matrix->size1, splitpoint).matrix; m2 = gsl_matrix_submatrix (in->matrix, 0, splitpoint, in->matrix->size1, in->matrix->size2-splitpoint).matrix; namecsplit = splitpoint; } else set_m1 = set_m2 = 0; goto allocation; } else { //splitpoint >= in->matrix->size2 if (in->vector){ v1 = gsl_vector_subvector(in->vector, 0, in->vector->size).vector; namev0 = 1; } else set_v1 = 0; set_v2 = 0; if (in->matrix){ m1 = gsl_matrix_submatrix (in->matrix, 0, 0, in->matrix->size1, in->matrix->size2).matrix; namec0 = 1; } else set_m1 = 0; set_m2 = 0; goto allocation; } } else Apop_notify(0, "Please set r_or_c == 'r' or == 'c'. Returning two NULLs."); return out; allocation: out[0] = apop_data_alloc(); out[1] = apop_data_alloc(); if (set_v1) out[0]->vector = apop_vector_copy(&v1); if (set_v2) out[1]->vector = apop_vector_copy(&v2); if (set_m1) out[0]->matrix = apop_matrix_copy(&m1); if (set_m2) out[1]->matrix = apop_matrix_copy(&m2); if (set_w1) out[0]->weights = apop_vector_copy(&w1); if (set_w2) out[1]->weights = apop_vector_copy(&w2); if (namev0 && out[0]) apop_name_stack(out[0]->names, in->names, 'v'); if (namev1 && out[1]) apop_name_stack(out[1]->names, in->names, 'v'); if (namersplit >=0) for (int k=0; k< in->names->rowct; k++){ int which = (k >= namersplit); assert(out[which]); apop_name_add(out[which]->names, in->names->row[k], 'r'); } else { if (namer0 && out[0]) apop_name_stack(out[0]->names, in->names, 'r'); if (namer1 && out[1]) apop_name_stack(out[1]->names, in->names, 'r'); } if (namecsplit >=0) for (int k=0; k< in->names->colct; k++){ int which = (k >= namecsplit); assert(out[which]); apop_name_add(out[which]->names, in->names->col[k], 'c'); } else { if (namec0 && out[0]) apop_name_stack(out[0]->names, in->names, 'c'); if (namec1 && out[1]) apop_name_stack(out[1]->names, in->names, 'c'); } //finally, the text [split by rows only] if (r_or_c=='r' && in->textsize[0] && in->textsize[1]){ apop_name_stack(out[1]->names, in->names, 't'); apop_text_alloc(out[0], splitpoint, in->textsize[1]); Apop_stopif(out[0]->error, return out, 0, "Allocation error."); if (in->textsize[0] > splitpoint){ apop_name_stack(out[0]->names, in->names, 't'); apop_text_alloc(out[1], in->textsize[0]-splitpoint, in->textsize[1]); Apop_stopif(out[1]->error, return out, 0, "Allocation error."); } for (int i=0; i< in->textsize[0]; i++) for (int j=0; j< in->textsize[1]; j++){ int whichtext = (i >= splitpoint); int row = whichtext ? i - splitpoint : i; Asprintf(&(out[whichtext]->text[row][j]), "%s", in->text[i][j]); } } return out; } /** Remove the columns set to one in the \c drop vector. \param n the \ref apop_name structure to be pared down \param drop a vector with n->colct elements, mostly zero, with a one marking those columns to be removed. \see \ref apop_data_prune_columns */ static void apop_name_rm_columns(apop_name *n, int *drop){ apop_name *newname = apop_name_alloc(); size_t initial_colct = n->colct; for (size_t i=0; i< initial_colct; i++){ if (drop[i]==0) apop_name_add(newname, n->col[i],'c'); else n->colct--; free(n->col[i]); } free(n->col); n->col = newname->col; //we need to free the newname struct, but leave the column intact. newname->col = NULL; newname->colct = 0; apop_name_free(newname); } static gsl_matrix *apop_matrix_rm_columns(gsl_matrix *in, int *drop){ int ct = 0, //how many columns will not be dropped? j = 0; for (size_t i=0; i < in->size2; i++) if (drop[i]==0) ct++; if (ct == in->size2) return apop_matrix_copy(in); if (ct == 0) return NULL; gsl_matrix *out = gsl_matrix_alloc(in->size1, ct); for (size_t i=0; i < in->size2; i++){ if (drop[i]==0){ gsl_vector *v = Apop_cv(&(apop_data){.matrix=in}, i); gsl_matrix_set_col(out, j, v); j ++; } } return out; } /** Remove the columns of the \ref apop_data set corresponding to a nonzero value in the \c drop vector. \li The returned data structure looks like it was modified in place, but the data matrix and the names are duplicated before being pared down, so if your data is taking up more than half of your memory, this may not work. \param d The \ref apop_data structure to be pared down. \param drop An array of ints. If use[7]==1, then column seven will be cut from the output. A reminder: calloc(in->size2 , sizeof(int)) will fill your array with zeros on allocation, and memset(use, 1, in->size2 * sizeof(int)) will quickly fill an array of ints with nonzero values. \ref apop_data_rm_rows */ void apop_data_rm_columns(apop_data *d, int *drop){ gsl_matrix *freeme = d->matrix; d->matrix = apop_matrix_rm_columns(d->matrix, drop); gsl_matrix_free(freeme); apop_name_rm_columns(d->names, drop); } /** \def apop_data_prune_columns(in, ...) Keep only the columns of a data set that you name. \param in The data set to prune. \param ... A list of names to retain (i.e. the columns that shouldn't be pruned out). For example, if you have run \ref apop_data_summarize, you have columns for several statistics, but may care about only one or two; see the example. For example: \include test_pruning.c \li I use a case-insensitive search to find your column. \li If your name multiple columns, I'll only give you the first. \li If I can't find a column matching one of your strings, I throw an error to the screen and continue. \li This is a macro calling \ref apop_data_prune_columns_base. It packages your list of columns into a list of strings, adds a \c NULL string at the end, and calls that function. \hideinitializer */ /** Keep only the columns of a data set that you name. This is the function called internally by the \ref apop_data_prune_columns macro. In most cases, you'll want to use that macro. An example of the two uses demonstrating the difference: \code apop_data_prune_columns(d, "mean", "median"); char *list[] = {"mean", "median", NULL}; apop_data_prune_columns_base(d, list); \endcode \param d The data set to prune. \param colnames A NULL-terminated list of names to retain. \return A pointer to the input data set, now pruned. \see apop_data_rm_columns */ apop_data* apop_data_prune_columns_base(apop_data *d, char **colnames){ /* In types.h, you'll find an alias that takes the input, wraps it in the cruft that is C's compound literal syntax, and appends a final "" to the list of strings. Here, I find each element of the list, using that "" as a stopper, and then call apop_data_rm_columns.*/ Apop_stopif(!d, return NULL, 1, "You're asking me to prune a NULL data set; returning."); Apop_stopif(!d->matrix, return d, 1, "You're asking me to prune a data set with NULL matrix; returning."); int rm_list[d->names->colct]; int keep_count = 0; char **name_step = colnames; //to throw errors for typos (and slight efficiency gains), I need an array of whether //each input colname has been used. while (*name_step++) keep_count++; int used_field[keep_count]; memset(used_field, 0, keep_count*sizeof(int)); for (int i=0; i< d->names->colct; i++){ int keep = 0; for (int j=0; jnames->col[i], colnames[j])){ keep ++; used_field[j]++; break; } rm_list[i] = !keep; } apop_data_rm_columns(d, rm_list); for (int j=0; jrowname==NULL, default is zero. \param col The column number of the desired element. -1 indicates the vector. If colname==NULL, default is zero. \param rowname The row name of the desired element. If NULL, use the row number. \param colname The column name of the desired element. If NULL, use the column number. \param page The case-insensitive name of the page on which the element is found. If \c NULL, use first page. \return A pointer to the element. */ #ifdef APOP_NO_VARIADIC double * apop_data_ptr(apop_data *data, int row, int col, const char *rowname, const char *colname, const char *page){ #else apop_varad_head(double *, apop_data_ptr){ apop_data * apop_varad_var(data, NULL); Apop_stopif(!data, return NULL, 0, "You sent me a NULL data set. Returning NULL pointer."); int apop_varad_var(row, 0); int apop_varad_var(col, 0); const char * apop_varad_var(rowname, NULL); const char * apop_varad_var(colname, NULL); const char * apop_varad_var(page, NULL); if (page){ data = apop_data_get_page(data, page); Apop_stopif(!data, return NULL, 1, "I couldn't find a page with label '%s'. Returning NULL.", page); }; if (rowname){ row = apop_name_find(data->names, rowname, 'r'); Apop_stopif(row == -2, return NULL, 1, "Couldn't find '%s' amongst the row names.", rowname); } if (colname){ col = apop_name_find(data->names, colname, 'c'); Apop_stopif(col == -2, return NULL, 1, "Couldn't find '%s' amongst the column names.", colname); } return apop_data_ptr_base(data, row, col, rowname, colname, page); } double * apop_data_ptr_base(apop_data *data, int row, int col, const char *rowname, const char *colname, const char *page){ #endif if (col == -1 || (col == 0 && !data->matrix && data->vector)){ Apop_stopif(!data->vector, return NULL, 1, "You asked for the vector element (col=-1) but it is NULL. Returning NULL."); return gsl_vector_ptr(data->vector, row); } else { Apop_stopif(!data->matrix, return NULL, 1, "You asked for the matrix element (%i, %i) but the matrix is NULL Returning NULL..", row, col); return gsl_matrix_ptr(data->matrix, row,col); } return NULL;//the main function is blank. } /** Returns the data element at the given point. In case of error (probably that you asked for a data point out of bounds), returns \c NAN. See \ref data_set_get "the set/get page" for details and examples. \param data The data set. Must not be \c NULL. \param row The row number of the desired element. If rowname==NULL, default is zero. \param col The column number of the desired element. -1 indicates the vector. If colname==NULL, default is zero if the ->matrix element is not \c NULL and -1 if the ->matrix element is \c NULL and the ->vector element is not. \param rowname The row name of the desired element. If NULL, use the row number. \param colname The column name of the desired element. If NULL, use the column number. \param page The case-insensitive name of the page on which the element is found. If \c NULL, use first page. \return The value at the given location. */ #ifdef APOP_NO_VARIADIC double apop_data_get(const apop_data *data, size_t row, int col, const char *rowname, const char *colname, const char *page){ #else apop_varad_head(double, apop_data_get){ const apop_data * apop_varad_var(data, NULL); Apop_stopif(!data, return NAN, 0, "You sent me a NULL data set. Returning NaN."); size_t apop_varad_var(row, 0); int apop_varad_var(col, 0); const char * apop_varad_var(rowname, NULL); const char * apop_varad_var(colname, NULL); const char * apop_varad_var(page, NULL); if (page){ data = apop_data_get_page(data, page); Apop_stopif(!data, return NAN, 1, "I couldn't find a page with label '%s'. Returning NaN.", page); }; if (rowname){ row = apop_name_find(data->names, rowname, 'r'); Apop_stopif(row == -2, return NAN, 1, "Couldn't find '%s' amongst the row names. Returning NaN.", rowname); } if (colname){ col = apop_name_find(data->names, colname, 'c'); Apop_stopif(col == -2, return NAN, 1, "Couldn't find '%s' amongst the column names. Returning NaN.", colname); } return apop_data_get_base(data, row, col, rowname, colname, page); } double apop_data_get_base(const apop_data *data, size_t row, int col, const char *rowname, const char *colname, const char *page){ #endif if (col==-1 || (col == 0 && !data->matrix && data->vector)){ Apop_stopif(!data->vector, return NAN, 1, "You asked for the vector element (col=-1) but it is NULL."); return gsl_vector_get(data->vector, row); } else { Apop_stopif(!data->matrix, return NAN, 1, "You asked for the matrix element (%zu, %i) but the matrix is NULL.", row, col); return gsl_matrix_get(data->matrix, row, col); } } /* The only hint the GSL gives that something failed is that the error-handler is called. The error handling function won't let you set an output to the function. So all we can do is use a global variable. */ static threadlocal int error_for_set; //see apop_internal.h void apop_gsl_error_for_set(const char *reason, const char *file, int line, int gsl_errno){ Apop_notify(1, "%s: %s", file, reason); Apop_maybe_abort(1); error_for_set = -1; } /** Set a data element. See \ref data_set_get "the set/get page" for details and examples. \return 0=OK, -1=error: couldn't find row/column name, or you asked for a location outside the vector/matrix bounds. \li The error codes for out-of-bounds errors are thread-safe iff you are have a C11-compliant compiler (thanks to the \c _Thread_local keyword) or a version of GCC with the \c __thread extension enabled. \li Set weights via gsl_vector_set(your_data->weights, row, val);. \li Set text elements via \ref apop_text_set. \param data The data set. Must not be \c NULL. \param row The row number of the desired element. If rowname==NULL, default is zero. \param col The column number of the desired element. -1 indicates the vector. If colname==NULL, default is zero. \param rowname The row name of the desired element. If NULL, use the row number. \param colname The column name of the desired element. If NULL, use the column number. \param page The case-insensitive name of the page on which the element is found. If \c NULL, use first page. \param val The value to give the point. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC int apop_data_set(apop_data *data, size_t row, int col, const double val, const char *colname, const char *rowname, const char *page){ #else apop_varad_head(int, apop_data_set){ apop_data * apop_varad_var(data, NULL); Apop_stopif(!data, return -1, 0, "You sent me a NULL data set."); size_t apop_varad_var(row, 0); int apop_varad_var(col, 0); const double apop_varad_var(val, 0); const char * apop_varad_var(rowname, NULL); const char * apop_varad_var(colname, NULL); const char * apop_varad_var(page, NULL); if (page){ data = apop_data_get_page((apop_data*)data, page); Apop_stopif(!data, return -1, 1, "I couldn't find a page with label '%s'. Making no changes.", page); } if (rowname){ row = apop_name_find(data->names, rowname, 'r'); Apop_stopif(row == -2, return -1, 1, "Couldn't find '%s' amongst the column names. Making no changes.", rowname); } if (colname){ col = apop_name_find(data->names, colname, 'c'); Apop_stopif(col == -2, return -1, 1, "Couldn't find '%s' amongst the column names. Making no changes.", colname); } return apop_data_set_base(data, row, col, val, colname, rowname, page); } int apop_data_set_base(apop_data *data, size_t row, int col, const double val, const char *colname, const char *rowname, const char *page){ #endif Set_gsl_handler if (col==-1 || (col == 0 && !data->matrix && data->vector)){ Apop_stopif(!data->vector, return -1, 1, "You're trying to set a vector element (row=-1) but the vector is NULL."); gsl_vector_set(data->vector, row, val); } else { Apop_stopif(!data->matrix, return -1, 1, "You're trying to set the matrix element (%zu, %i) but the matrix is NULL.", row, col); gsl_matrix_set(data->matrix, row, col, val); } Unset_gsl_handler return error_for_set; } /** A convenience function to add a named element to a data set. Many of Apophenia's testing procedures use this to easily produce a column of named parameters. It is public as a convenience. \param d The \ref apop_data structure. Must not be \c NULL, but may be blank (as per allocation via \ref apop_data_alloc ( ) ). \param name The name to add \param val the value to add to the set. \li I use the position of the last non-empty row name to know where to put the value. If there are two names in the data set, then I will put the new name in the third name slot and the data in the third slot in the vector. If you use this function from start to finish in building your list, then you'll be fine. \li If the vector is too short (or \c NULL), I will call \ref apop_vector_realloc internally to make space. \li This fits well with the defaults for \ref apop_data_get. An example: \code apop_data *list = apop_data_alloc(); apop_data_add_named_elmt(list, "height", 165); apop_data_add_named_elmt(list, "weight", 60); double height = apop_data_get(list, .rowname="height"); //or #define Lookup(dataset, key) apop_data_get(dataset, .rowname=#key) height = Lookup(list, height); \endcode */ void apop_data_add_named_elmt(apop_data *d, char *name, double val){ Apop_stopif(!d, return, 0, "You sent me a NULL apop_data set. " "Maybe allocate with apop_data_alloc() to start."); apop_name_add(d->names, name, 'r'); if (!d->vector) d->vector = gsl_vector_alloc(1); if (d->vector->size < d->names->rowct) apop_vector_realloc(d->vector, d->names->rowct); gsl_vector_set(d->vector, d->names->rowct-1, val); } //See apop_data_add_names in types.h. void apop_data_add_names_base(apop_data *d, const char type, char const ** names){ if (!d->names) d->names = apop_name_alloc(); for(char const** name = names; *name !=NULL; name++) apop_name_add(d->names, *name, type); } /** Add a string to the text element of an \ref apop_data set. If you send me a \c NULL string, I will write the value of apop_opts.nan_string in the given slot. If there is already something in that slot, that string is freed, preventing memory leaks. \param in The \ref apop_data set, that already has an allocated \c text element. \param row The row \param col The column \param fmt The text to write. \param ... You can use a printf-style fmt and follow it with the usual variables to fill in. \return 0=OK, -1=error (probably out-of-bounds) \li UTF-8 or ASCII text is correctly handled. \li Apophenia follows a general rule of not reallocating behind your back: if your text matrix is currently of size (3,3) and you try to put an item in slot (4,4), then I display an error rather than reallocating the text matrix. \li The string added is a copy (via asprintf), not a pointer to the input(s). \li If there had been a string at the grid point you are writing to, the old one is freed to prevent leaks. Remember this if you had other pointers aliasing that string. \li If an element is \c NULL, write apop_opts.nan_string at that point. You may prefer to use "" to express a blank. \li \ref apop_text_alloc will reallocate to a new size if you need. For example, this code will fill the diagonals of the text array with a message, resizing as it goes: \code apop_data *list = (something already allocated.); for (int n=0; n < 10; n++){ apop_text_alloc(list, n+1, n+1); apop_text_set(list, n, n, "This is cell (%i, %i)", n, n); } \endcode */ int apop_text_set(apop_data *in, const size_t row, const size_t col, const char *fmt, ...){ Apop_stopif(!in, return -1, 0, "You asked me to write text to a NULL data set."); Apop_stopif((in->textsize[0] < (int)row+1) || (in->textsize[1] < (int)col+1), return -1, 0, "You asked me to put the text " " '%s' at position (%zu, %zu), but the text array has size (%zu, %zu)\n", fmt, row, col, in->textsize[0], in->textsize[1]); if (in->text[row][col] != apop_nul_string) free(in->text[row][col]); if (!fmt){ Asprintf(&(in->text[row][col]), "%s", apop_opts.nan_string); return 0; } va_list argp; va_start(argp, fmt); Apop_stopif(vasprintf(&(in->text[row][col]), fmt, argp)==-1, , 0, "Trouble writing to a string."); va_end(argp); return 0; } /** This allocates or resizes the \c text element of an \ref apop_data set. If the \c text element already exists, then this is effectively a \c realloc function, reshaping to the size you specify. \param in An \ref apop_data set. It's OK to send in \c NULL, in which case an apop_data set with \c NULL \c matrix and \c vector elements is returned. \param row the number of rows of text. \param col the number of columns of text. \return A pointer to the relevant \ref apop_data set. If the input was not \c NULL, then this is a repeat of the input pointer. \exception out->error=='a' Allocation error. */ apop_data * apop_text_alloc(apop_data *in, const size_t row, const size_t col){ Apop_stopif((!row && col) || (!col && row), return in, 1, "Not allocating a %zu x %zu text grid. " "Returning the input apop_data set.", row, col); if (!in) in = apop_data_alloc(); if (!in->text){ if (row){ in->text = malloc(sizeof(char**) * row); Apop_stopif(!in->text, in->error='a'; return in, 0, "malloc failed setting up %zu rows. Probably out of memory.", row); } if (row && col) for (size_t i=0; i< row; i++){ in->text[i] = malloc(sizeof(char*) * col); Apop_stopif(!in->text[i], in->error='a'; return in, 0, "malloc failed setting up row %zu (with %zu columns). Probably out of memory.", i, col); for (size_t j=0; j< col; j++) in->text[i][j] = apop_nul_string; } } else { //realloc size_t rows_now = in->textsize[0]; size_t cols_now = in->textsize[1]; if (rows_now > row){ for (int i=row; i < rows_now; i++){ for (int j=0; j < cols_now; j++) if (in->text[i][j] != apop_nul_string) free(in->text[i][j]); free(in->text[i]); } in->text = realloc(in->text, sizeof(char**)*row); Apop_stopif(row && !in->text, in->error='a'; return in, 0, "realloc failed shrinking down to %zu rows from %zu rows. " "There may be actual bugs eating your computer.", row, rows_now); } if (rows_now < row){ in->text = realloc(in->text, sizeof(char**)*row); Apop_stopif(!in->text, in->error='a'; return in, 0, "realloc failed setting up %zu rows. Probably out of memory.", row); for (size_t i=rows_now; i < row; i++){ in->text[i] = malloc(sizeof(char*) * col); Apop_stopif(!in->text[i], in->error='a'; return in, 0, "malloc failed setting up row %zu (with %zu columns). Probably out of memory.", i, col); for (int j=0; j < cols_now; j++) in->text[i][j] = apop_nul_string; } } if (cols_now > col) for (int i=0; i < row; i++) for (int j=col; j < cols_now; j++) if (in->text[i][j]!=apop_nul_string) free(in->text[i][j]); if (cols_now != col) for (int i=0; i < row; i++){ in->text[i] = realloc(in->text[i], sizeof(char*)*col); for (int j=cols_now; j < col; j++) //happens iff cols_now < col in->text[i][j] = apop_nul_string; } } in->textsize[0] = row; in->textsize[1] = col; return in; } /** Transpose the matrix and text elements of the input data set, including the row/column names. The vector and weights elements of the input data set are completely ignored (but see also \ref apop_vector_to_matrix, which can convert a vector to a 1 X N matrix.) If copying, these other elements won't be present; if .inplace='y', it is up to you to handle these not-transposed elements correctly. \param in The input \ref apop_data set. If \c NULL, I return \c NULL. (default: \c NULL) \param transpose_text If \c 'y', then also transpose the text element. (default: \c 'y') \param inplace If \c 'y', transpose the input in place; if \c 'n', produce a transposed copy, leaving the original untouched. Due to how gsl_matrix_transpose_memcpy works, a copy will still be made, then copied to the original location. (default: \c 'y') \return If inplace=='n', a newly alloced \ref apop_data set, with the appropriately transposed matrix and/or text. The vector and weights elements will be \c NULL. If transpose_text='n', then the text element of the output set will also be \c NULL.
if inplace=='y', a pointer to the original data set, with matrix and (if transpose_text='y', text) transposed and vector and weights left in place untouched. \li Row names are written to column names of the output matrix, text, or both (whichever is not empty in the input). \li If only the matrix or only the text have names, then the one set of names is written to the row names of the output. \li If both matrix column names and text column names are present, text column names are lost. \li if you have a \c gsl_matrix with no names or text, you may prefer to use \c gsl_matrix_transpose_memcpy. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_transpose(apop_data *in, char transpose_text, char inplace){ #else apop_varad_head(apop_data *, apop_data_transpose){ apop_data * apop_varad_var(in, NULL); Apop_stopif(!in, return NULL, 1, "Transposing a NULL data set; returning NULL."); char apop_varad_var(transpose_text, 'y'); char apop_varad_var(inplace, 'y'); return apop_data_transpose_base(in, transpose_text, inplace); } apop_data * apop_data_transpose_base(apop_data *in, char transpose_text, char inplace){ #endif Apop_stopif(!in->matrix && !*in->textsize, return apop_data_alloc(), 1, "input data set has neither matrix nor text elements; returning an empty data set."); apop_data *out = (inplace=='y') ? in : apop_data_alloc(0, in->matrix ? in->matrix->size2 : 0 , in->matrix ? in->matrix->size1 : 0); if (inplace=='y'){ if (in->matrix) { if (in->matrix->size1 == in->matrix->size2) gsl_matrix_transpose(in->matrix); else { gsl_matrix *outm = gsl_matrix_alloc(in->matrix->size2, in->matrix->size1); gsl_matrix_transpose_memcpy(outm, in->matrix); gsl_matrix_free(in->matrix); in->matrix = outm; } } if (out->names){ char **tmp = out->names->col; out->names->col = out->names->row; out->names->row = tmp; int tmpct = out->names->colct; out->names->colct = out->names->rowct; out->names->rowct = tmpct; } } else if (inplace!='y' && in->matrix){ if (in->matrix) gsl_matrix_transpose_memcpy(out->matrix, in->matrix); apop_name_stack(out->names, in->names, 'r', 'c'); apop_name_stack(out->names, in->names, 'c', 'r'); } if (transpose_text!='y' || in->textsize[0] == 0 || in->textsize[1] == 0) return out; if (inplace=='y'){ size_t orows = in->textsize[0]; size_t ocols = in->textsize[1]; if (orows > ocols){ //extend the first ocols rows to their now-longer length for (size_t i=0; i< ocols; i++){ in->text[i] = realloc(in->text[i], sizeof(char*)*orows); Apop_stopif(!in->text[i], in->error='a'; return in, 0, "malloc failed setting up row %zu (with %zu columns). Probably out of memory.", i, orows); for (int j=ocols; j < orows; j++) in->text[i][j] = in->text[j][i] == apop_nul_string ? apop_nul_string : strdup(in->text[j][i]); } } if (ocols > orows){ //add rows. in->text = realloc(in->text, sizeof(char**)*ocols); Apop_stopif(!in->text, in->error='a'; return in, 0, "realloc failed setting up %zu rows. Probably out of memory.", ocols); for (size_t i=orows; i < ocols; i++){ in->text[i] = malloc(sizeof(char*) * orows); Apop_stopif(!in->text[i], in->error='a'; return in, 0, "malloc failed setting up row %zu (with %zu columns). Probably out of memory.", i, orows); for (int j=0; j < orows; j++) in->text[i][j] = in->text[j][i] == apop_nul_string ? apop_nul_string : strdup(in->text[j][i]); } } size_t squaresize = GSL_MIN(orows, ocols); for (int i=0; i< squaresize; i++) //now do the no-need-to-extend square for (int j=i+1; j< squaresize; j++){ char *tmp = in->text[i][j]; in->text[i][j] = in->text[j][i]; in->text[j][i] = tmp; } in->textsize[0] = ocols; in->textsize[1] = orows; } else { apop_text_alloc(out, in->textsize[1], in->textsize[0]); for (int r=0; r< in->textsize[0]; r++) for (int c=0; c< in->textsize[1]; c++) if (in->text[r][c] == apop_nul_string) apop_text_blank(out, c, r); else apop_text_set(out, c, r, in->text[r][c]); } if (in->names && in->names->textct && !in->names->colct) apop_name_stack(out->names, in->names, 't', 'r'); return out; } /** This function will resize a \c gsl_matrix to a new height or width. Data in the matrix will be retained. If the new height or width is smaller than the old, then data in the later rows/columns will be cropped away (in a non--memory-leaking manner). If the new height or width is larger than the old, then new cells will be filled with garbage; it is your responsibility to zero out or otherwise fill new rows/columns before use. \li A large number of reallocs can take a noticeable amount of time. You are encouraged to determine the size of your data beforehand and avoid writing \c for loops that reallocate the matrix at every iteration. \li The gsl_matrix is a versatile struct that can represent submatrices and other cuts from parent data. Resizing a subset of a parent matrix makes no sense, so return \c NULL and print a warning if asked to resize a view of a matrix. \param m The already-allocated matrix to resize. If you give me \c NULL, this becomes equivalent to \c gsl_matrix_alloc \param newheight, newwidth The height and width you'd like the matrix to be. \return m, now resized */ gsl_matrix * apop_matrix_realloc(gsl_matrix *m, size_t newheight, size_t newwidth){ if (!m) return (newheight && newwidth) ? gsl_matrix_alloc(newheight, newwidth) : NULL; size_t i, oldoffset=0, newoffset=0, realloced = 0; Apop_stopif(m->block->data!=m->data || !m->owner || m->tda != m->size2, return NULL, 0, "I can't resize submatrices or other subviews."); m->block->size = newheight * newwidth; if (m->size2 > newwidth) for (i=1; i< GSL_MIN(m->size1, newheight); i++){ oldoffset +=m->size2; newoffset +=newwidth; memmove(m->data+newoffset, m->data+oldoffset, sizeof(double)*newwidth); } else if (m->size2 < newwidth){ m->block->data = m->data = realloc(m->data, sizeof(double) * m->block->size); realloced = 1; int height = GSL_MIN(m->size1, newheight); for (i= height-1; i > 0; i--){ newoffset +=newwidth; memmove(m->data+(height * newwidth) - newoffset, m->data+i*m->size2, sizeof(double)*m->size2); } } m->size1 = newheight; m->tda = m->size2 = newwidth; if (!realloced) m->block->data = m->data = realloc(m->data, sizeof(double) * m->block->size); return m; } /** This function will resize a \c gsl_vector to a new length. Data in the vector will be retained. If the new height is smaller than the old, then data at the end of the vector will be cropped away (in a non--memory-leaking manner). If the new height is larger than the old, then new cells will be filled with garbage; it is your responsibility to zero out or otherwise fill them before use. \li A large number of reallocs can take a noticeable amount of time. You are thus encouraged to make an effort to determine the size of your data and do one allocation, rather than writing \c for loops that resize a vector at every increment. \li The gsl_vector is a versatile struct that can represent subvectors, matrix columns and other cuts from parent data. Resizing a portion of a parent matrix makes no sense, so return \c NULL and print an error if asked to resize a view. \param v The already-allocated vector to resize. If you give me \c NULL, this is equivalent to \c gsl_vector_alloc \param newheight The height you'd like the vector to be. \return v, now resized */ gsl_vector * apop_vector_realloc(gsl_vector *v, size_t newheight){ if (!v) return newheight ? gsl_vector_alloc(newheight) : NULL; Apop_stopif(v->block->data!=v->data || !v->owner || v->stride != 1, return NULL, 0, "I can't resize subvectors or other views."); v->block->size = newheight; v->size = newheight; v->block->data = v->data = realloc(v->data, sizeof(double) * v->block->size); return v; } /** It's good form to get a page from your data set by name, because you may not know the order for the pages, and the stepping through makes for dull code anyway (apop_data *page = dataset; while (page->more) page= page->more;). \param data The \ref apop_data set to use. No default; if \c NULL, gives a warning if apop_opts.verbose >=1 and returns \c NULL. \param title The name of the page to retrieve. Default=\c "", which is the name of the page of additional estimation information returned by estimation routines (log likelihood, status, AIC, BIC, confidence intervals, ...). \param match If \c 'c', case-insensitive match (via \c strcasecmp); if \c 'e', exact match, if \c 'r' regular expression substring search (via \ref apop_regex). Default=\c 'c'. \return The page whose title matches what you gave me. If I don't find a match, return \c NULL. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_get_page(const apop_data * data, const char *title, const char match){ #else apop_varad_head(apop_data *, apop_data_get_page){ const apop_data * apop_varad_var(data, NULL); Apop_stopif(!data, return NULL, 1, "You requested a page from a NULL data set. Returning NULL"); const char * apop_varad_var(title, ""); const char apop_varad_var(match, 'c'); Apop_stopif(match!='r' && match!='e' && match!='c', return NULL, 0, "match type needs to be 'r', 'e', or 'c'; you supplied %c.", match); return apop_data_get_page_base(data, title, match); } apop_data * apop_data_get_page_base(const apop_data * data, const char *title, const char match){ #endif while (data && (!data->names || !data->names->title || (match=='c' && strcasecmp(data->names->title, title)) || (match=='r' && !apop_regex(data->names->title, title)) || (match=='e' && strcmp(data->names->title, title)) )) data = data->more; return (apop_data *) data; //de-const. } /** Add a page to an \ref apop_data set. It gets a name so you can find it later. \param dataset The input data set, to which a page will be added. \param newpage The page to append \param title The name of the new page. \return The new page. I post a warning if I am appending or appending to a \c NULL data set and apop_opts.verbose >=1 . \li See \ref pps for further notes. */ apop_data * apop_data_add_page(apop_data * dataset, apop_data *newpage, const char *title){ Apop_stopif(!newpage, return NULL, 1, "You are adding a NULL page to a data set. Doing nothing; returning NULL."); if (!newpage->names) newpage->names = apop_name_alloc(); if (title && !(newpage->names->title == title)){//has title, but is not pointing to existing title free(newpage->names->title); Asprintf(&newpage->names->title, "%s", title); } Apop_stopif(!dataset, return newpage, 1, "You are adding a page to a NULL data set. Returning the new page as its own data set."); while (dataset->more) dataset = dataset->more; dataset->more = newpage; return newpage; } /** Remove the first page from an \ref apop_data set that matches a given name. \param data The input data set, from which a page will be removed. No default. If \c NULL, maybe print a warning (see below). \param title The case-insensitive name of the page to remove. Default: \c "" \param free_p If \c 'y', then \ref apop_data_free the page. Default: \c 'y'. \return If not freed, a pointer to the \c apop_data page that I just pulled out. Thus, you can use this to pull a single page from a data set. I set that page's \c more pointer to \c NULL, to minimize any confusion about more-than-linear linked list topologies. If free_p=='y' (the default) or the page is not found, return \c NULL. \li I don't check the first page, so there's no concern that the head of your list of pages will move. Again, the intent of the ->more pointer in the \ref apop_data set is not to fully implement a linked list, but primarily to allow you to staple auxiliary information to a main data set. \li If I don't find the page you want, I return NULL, and maybe print a warning; see below. \li For the two above cases where a warning may be printed, if the page is to be returned and apop_opts.verbose >= 1 , print a warning. If the page is to be freed and apop_opts.verbose >= 2 , print a warning. \li The remaining \c more pointers in the \ref apop_data set are adjusted accordingly. */ #ifdef APOP_NO_VARIADIC apop_data* apop_data_rm_page(apop_data * data, const char *title, const char free_p){ #else apop_varad_head(apop_data*, apop_data_rm_page){ const char *apop_varad_var(title, ""); const char apop_varad_var(free_p, 'y'); apop_data *apop_varad_var(data, NULL); Apop_stopif(!data, return NULL, (free_p=='y'? 2: 1), "You are removing a " "page from a NULL a data set. Doing nothing."); return apop_data_rm_page_base(data, title, free_p); } apop_data* apop_data_rm_page_base(apop_data * data, const char *title, const char free_p){ #endif while (data->more && strcasecmp(data->more->names->title, title)) data = data->more; Apop_stopif(!data->more, return NULL, (free_p=='y'?2:1), "You asked me to " "remove '%s' but I couldn't find a page matching that.", title); if (data->more){ apop_data *tmp = data->more; data->more = data->more->more; tmp->more = NULL; if (free_p=='y'){ free(tmp); return NULL; } //else: return tmp; } else return NULL; } typedef int (*apop_fn_ir)(apop_data*, void*); /** Remove the rows set to one in the \c drop vector or for which the \c do_drop function returns one. \param in the \ref apop_data structure to be pared down \param drop a vector with as many elements as the max of the vector, matrix, or text parts of \c in, with a one marking those rows to be removed. \param do_drop A function that returns one for rows to drop and zero for rows to not drop. A sample function: \code int your_drop_function(apop_data *onerow, void *extra_param){ return gsl_isnan(apop_data_get(onerow)) || !strcmp(onerow->text[0][0], "Uninteresting data point"); } \endcode \ref apop_data_rm_rows will use \ref Apop_r to get a subview of the input data set of height one, and send that subview to this function (and since arguments typically default to zero, you don't have to write out things like \ref apop_data_get (onerow, .row=0, .col=0), which can help to keep things readable). \param drop_parameter If your \c do_drop function requires additional input, put it here and it will be passed through. \return Returns a pointer to the input data set, now pruned. \li If all the rows are to be removed, then you will wind up with the same \ref apop_data set, with \c NULL \c vector, \c matrix, \c weight, and text. Therefore, you may wish to check for \c NULL elements after use. I remove rownames, but leave the other names, in case you want to add new data rows. \li The typical use is to provide only a list or only a function. If both are \c NULL, I return without doing anything, and print a warning if apop_opts.verbose >=2. If you provide both, I will drop the row if either the vector has a one in that row's position, or if the function returns a nonzero value. \li This function uses the \ref designated syntax for inputs. \see \ref apop_data_listwise_delete, \ref apop_data_rm_columns */ #ifdef APOP_NO_VARIADIC apop_data* apop_data_rm_rows(apop_data *in, int *drop, apop_fn_ir do_drop, void *drop_parameter ){ #else apop_varad_head(apop_data*, apop_data_rm_rows){ apop_data* apop_varad_var(in, NULL); Apop_stopif(!in, return in, 2, "Input data set was NULL; no changes made."); int* apop_varad_var(drop, NULL); apop_fn_ir apop_varad_var(do_drop, NULL); void* apop_varad_var(drop_parameter, NULL); Apop_stopif(!drop && !do_drop, return in, 0, "You gave me neither a list of ints " "indicating which rows to drop, nor a drop_fn I can use to test " "each row. Returning with no changes made."); return apop_data_rm_rows_base(in, drop, do_drop, drop_parameter); } apop_data* apop_data_rm_rows_base(apop_data *in, int *drop, apop_fn_ir do_drop, void *drop_parameter ){ #endif //First, shift columns down to the nearest not-freed row. int outlength = 0; Get_vmsizes(in); //vsize, msize1, maxsize for (int i=0 ; i < maxsize; i++){ int drop_row=0; if (drop && drop[i]) drop_row = 1; else if (do_drop){ drop_row = do_drop(Apop_r(in, i), drop_parameter); } if (!drop_row){ if (outlength != i) apop_data_memcpy(Apop_r(in, outlength), Apop_r(in, i)); outlength++; } } if (!outlength){ gsl_vector_free(in->vector); in->vector = NULL; gsl_vector_free(in->weights); in->weights = NULL; gsl_matrix_free(in->matrix); in->matrix = NULL; apop_text_alloc(in, 0, 0); //leave colnames intact, remove rownames below. } //now trim excess memory: if (in->vector) apop_vector_realloc(in->vector, GSL_MIN(in->vector->size, outlength)); if (in->weights) apop_vector_realloc(in->weights, GSL_MIN(in->weights->size, outlength)); if (in->matrix) apop_matrix_realloc(in->matrix, GSL_MIN(in->matrix->size1, outlength), in->matrix->size2); if (in->text) apop_text_alloc(in, GSL_MIN(outlength, in->textsize[0]), in->textsize[1]); if (in->names && in->names->rowct > outlength){ for (int k=outlength; k< in->names->rowct; k++) free(in->names->row[k]); in->names->rowct = outlength; } return in; } apophenia-1.0+ds/apop_db.c000066400000000000000000000703761262736346100155170ustar00rootroot00000000000000 /** \file apop_db.c An easy front end to SQLite. Includes a few nice features like a variance, skew, and kurtosis aggregator for SQL. */ /* Copyright (c) 2006--2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" /** Here are where the options are initially set. See the \ref apop_opts_type documentation for details. \ingroup all_public */ apop_opts_type apop_opts = { .verbose=1, .output_delimiter ="\t", .input_delimiters = "|,\t", .db_name_column = "row_names", .nan_string = "NaN", .db_engine = '\0', .db_user = "\0", .db_pass = "\0", .stop_on_warning = 'n', .log_file = NULL, .rng_seed = 479901, .version = 1.0 }; #define ERRCHECK {Apop_stopif(err, return 1, 0, "%s: %s",query, err); } #define ERRCHECK_NR {Apop_stopif(err, return NULL, 0, "%s: %s",query, err); } #define ERRCHECK_SET_ERROR(outdata) {Apop_stopif(err, if (!(outdata)) (outdata)=apop_data_alloc(); (outdata)->error='q'; sqlite3_free(err); return outdata, 0, "%s: %s",query, err); } #include "apop_db_sqlite.c" // callback_t is defined here, btw. #ifdef HAVE_MYSQL //Let mysql have these. #undef VERSION #undef PACKAGE #undef PACKAGE_NAME #undef PACKAGE_STRING #undef PACKAGE_TARNAME #undef PACKAGE_VERSION #undef PACKAGE_BUGREPORT #include "apop_db_mysql.c" #endif //if !apop_opts.db_engine, run this to assign a value. static void get_db_type(){ if (getenv("APOP_DB_ENGINE") && (!strcasecmp(getenv("APOP_DB_ENGINE"), "mysql") || !strcasecmp(getenv("APOP_DB_ENGINE"), "mariadb"))) apop_opts.db_engine = 'm'; else apop_opts.db_engine = 's'; } //This macro declares the query string and fills it from the printf part of the call. #define Fillin(query, fmt) \ char *query; \ va_list argp; \ va_start(argp, fmt); \ Apop_stopif(vasprintf(&query, fmt, argp)==-1, , 0, "Trouble writing to a string."); \ va_end(argp); \ Apop_notify(2, "%s", query); /** If you want to use a database on the hard drive instead of memory, then call this once and only once before using any other database utilities. With SQLite, if you want a disposable database which you won't use after the program ends, don't bother with this function. The trade-offs between an on-disk database and an in-memory db are as one would expect: memory is faster, but the database is destroyed when the program exits. MySQL users: either set the environment variable APOP_DB_ENGINE=mysql or set \c apop_opts.db_engine = 'm'. The Apophenia package assumes you are only using a single database at a time. You can use the SQL attach function to load other databases, or see this blog post for further suggestions and sample code. When you are done doing your database manipulations, call \ref apop_db_close if writing to disk. \param filename The name of a file on the hard drive on which to store the database. If NULL, then the database will be kept in memory (in which case, the other database functions will call this function for you and you don't need to bother). \li See \ref sqlsec for mroe notes on using databases. \return 0: everything OK
1: database did not open. */ int apop_db_open(char const *filename){ if (!apop_opts.db_engine) get_db_type(); if (!db) //check the environment. #ifdef HAVE_MYSQL if(!mysql_db) #endif if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL return apop_mysql_db_open(filename); #else {Apop_stopif(1, return -1, 0, "Apophenia was compiled without mysql support.");} #endif return apop_sqlite_db_open(filename); } /** \cond doxy_ignore */ typedef struct { char const *name; int isthere; } tab_exists_t; /** \endcond */ static int tab_exists_callback(void *in, int argc, char **argv, char **whatever){ tab_exists_t *te = in; if (!strcmp(argv[argc-1], te->name)) te->isthere=1; return 0; } /** Check for the existence of a table, and maybe delete it. Recreating a table which already exists can cause errors, so it is good practice to check for existence first. Also, this is the stylish way to delete a table, since just calling "drop table" will give you an error if the table doesn't exist. \param name the table name (no default) \param remove 'd' ==>delete table so it can be recreated in main.
'n' ==>no action. Return result so program can continue. (default) \return 0 = table does not exist
1 = table was found, and if remove=='d', has been deleted -1 = processing error \li In the SQLite engine, this function considers table views to be tables. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC int apop_table_exists(char const *name, char remove){ #else apop_varad_head(int, apop_table_exists){ char const *apop_varad_var(name, NULL) Apop_stopif(!name, return -1, 0, "You gave me a NULL table name."); char apop_varad_var(remove, 'n') return apop_table_exists_base(name, remove); } int apop_table_exists_base(char const *name, char remove){ #endif if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL return apop_mysql_table_exists(name, remove); #else Apop_stopif(1, return -1, 0, "Apophenia was compiled without mysql support."); #endif char *err=NULL, *q2; tab_exists_t te = { .name = name }; tab_exists_t tev = { .name = name }; if (db==NULL) return 0; sqlite3_exec(db, "select name from sqlite_master where type='table'", tab_exists_callback, &te, &err); sqlite3_exec(db, "select name from sqlite_master where type='view'", tab_exists_callback, &tev, &err); char query[]="Selecting names from sqlite_master";//for ERRCHECK. ERRCHECK if ((remove==1|| remove=='d') && (te.isthere||tev.isthere)){ if (te.isthere) Asprintf(&q2, "drop table %s;", name); else Asprintf(&q2, "drop view %s;", name); sqlite3_exec(db, q2, NULL, NULL, &err); free(q2); ERRCHECK } return (te.isthere||tev.isthere); } /** Closes the database on disk. If you opened the database with \c apop_db_open(NULL), then this is basically optional. \param vacuum 'v': vacuum---do clean-up to minimize the size of the database on disk.
'q': Don't bother; just close the database. (default = 'q') \return 0 on OK, nonzero on error. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC int apop_db_close(char vacuum){ #else apop_varad_head(int, apop_db_close){ char apop_varad_var(vacuum, 'q') return apop_db_close_base(vacuum); } int apop_db_close_base(char vacuum){ #endif if (apop_opts.db_engine == 'm') //assume this is set by now... #ifdef HAVE_MYSQL {apop_mysql_db_close(0); return 0;} #else {Apop_stopif(1, return -1, 0, "Apophenia was compiled without mysql support.");} #endif else { char *err, *query = "db close";//for errcheck. if (vacuum==1 || vacuum=='v') { sqlite3_exec(db, "VACUUM", NULL, NULL, &err); ERRCHECK } sqlite3_close(db); //ERRCHECK db = NULL; } return 0; } /** Send a query to the database that returns no data. \li As with functions like the \c apop_query_to_data, the query can include printf-style format specifiers, such as apop_query("create table %s(id, name, age);", tablename). \param fmt A printf-style SQL query. \return 0 on success, 1 on failure. */ int apop_query(const char *fmt, ...){ char *err=NULL; Fillin(query, fmt) if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL {Apop_stopif(!mysql_db, return 1, 0, "No mySQL database is open."); return apop_mysql_query(query);} #else Apop_stopif(1, return 1, 0, "Apophenia was compiled without mysql support."); #endif else {if (!db) apop_db_open(NULL); sqlite3_exec(db, query, NULL,NULL, &err); ERRCHECK } free(query); return 0; } /** Dump the results of a query into an array of strings. \return An \ref apop_data structure with the text element filled. \param fmt A printf-style SQL query. \exception out->error=='q' The database engine was unable to run the query (e.g., invalid SQL syntax). Again, a valid query that returns zero rows is not an error, and \c NULL is returned. \exception out->error=='d' Database error. \li If apop_opts.db_name_column matches a column of the output table, then that column is used for row names, and therefore will not be included in the text. \li query_output->text is always a 2-D array of strings, even if the query returns a single column. In that case, use returned_tab->text[i][0] (or equivalently, *returned_tab->text[i]) to refer to row i. \li If an element in the database is \c NULL, the corresponding cell in the output table will be filled with the text given by \c apop_opts.nan_string. The default is \c "NaN", but you can set apop_opts.nan_string = "whatever you like" to change the text to whatever you like. \li Returns \c NULL if your query is valid but returns zero rows. \li The query can include printf-style format specifiers, such as apop_query_to_text("select name from %s where id=%i;", tablename, id_number). For example, the following function will list the tables in an SQLite database (much like you could do from the command line using sqlite3 dbname.db ".table"). \include ls_tables.c */ apop_data * apop_query_to_text(const char * fmt, ...){ apop_data *out = NULL; Fillin(query, fmt) if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm'){ #ifdef HAVE_MYSQL out = apop_mysql_query_core(query, process_result_set_chars); #else Apop_stopif(1, apop_return_data_error('d'), 0, "Apophenia was compiled without mysql support."); #endif } else out = apop_sqlite_query_to_text(query); free(query); return out; } //apop_query_to_data callback. static int db_to_table(void *qinfo, int argc, char **argv, char **column){ Apop_stopif(!argv, return -1, apop_errorlevel, "Got NULL data from SQLite."); int i, ncfound = 0; callback_t *qi= qinfo; if (qi->firstcall){ qi->firstcall--; for(i=0; inamecol = i; ncfound = 1; break; } qi->outdata = argc-ncfound ? apop_data_alloc(1, argc-ncfound) : apop_data_alloc( ); for(i=0; inamecol != i) apop_name_add(qi->outdata->names, column[i], 'c'); } else if (qi->outdata->matrix) apop_matrix_realloc(qi->outdata->matrix, qi->currentrow+1, qi->outdata->matrix->size2); ncfound =0; for (int jj=0;jjnamecol){ double valor = !argv[jj] || !strcmp(argv[jj], "NULL")|| (apop_opts.nan_string && !strcasecmp(apop_opts.nan_string, argv[jj])) ? GSL_NAN : atof(argv[jj]); gsl_matrix_set(qi->outdata->matrix,qi->currentrow,jj-ncfound, valor); } else { apop_name_add(qi->outdata->names, argv[jj], 'r'); ncfound = 1; } (qi->currentrow)++; return 0; } /** Queries the database and dumps the result into an \ref apop_data set. \param fmt A printf-style SQL query. \return If no rows are returned, \c NULL; else an \ref apop_data set with the data in place. Most data will be in the \c matrix element of the output. Column names are appropriately placed. If \ref apop_opts_type "apop_opts.db_name_column" matches one of the fields in your query's output (default: \c row_names), then that column will be used for row names (and therefore will not appear in the \c matrix). \exception out->error=='q' Query error. A valid query that returns no rows is not an error; in that case, you get \c NULL. \li The query can include printf-style format specifiers, such as apop_query_to_data("select age from %s where id=%i;", tablename, id_number). \li Blanks in the database (i.e., NULLs) and elements that match \ref apop_opts_type "apop_opts.nan_string" are filled with NANs in the matrix. */ apop_data * apop_query_to_data(const char * fmt, ...){ Fillin(query, fmt) if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL return apop_mysql_query_core(query, process_result_set_data); #else Apop_stopif(1, apop_return_data_error('d'), 0, "Apophenia was compiled without mysql support."); #endif //else char *err=NULL; callback_t qinfo = {.firstcall = 1, .namecol=-1}; if (db==NULL) apop_db_open(NULL); sqlite3_exec(db, query,db_to_table,&qinfo, &err); free (query); ERRCHECK_SET_ERROR(qinfo.outdata) return qinfo.outdata; } /** \cond doxy_ignore */ //These used to do more, but I'll leave them as a macro anyway in case of future expansion. #define Store_settings \ int v = apop_opts.verbose; apop_opts.verbose=0;/*hack to prevent double-printing.*/ \ #define Restore_settings \ apop_opts.verbose=v; /** \endcond */ /** Queries the database and dumps the first column of the result into a \c gsl_vector. \param fmt A printf-style SQL query. \return A gsl_vector holding the first column of the returned matrix. Thus, if your query returns multiple lines, you will get no warning, and the function will return the first in the list. \exception out->error=='q' Query error. A valid query that returns no rows is not an error; in that case, you get \c NULL. \li Uses \ref apop_query_to_data internally, then throws away all but the first column of the matrix. \li If \c apop_opts.db_name_column is set, then I'll ignore that column. It gets put into the names of the \ref apop_data set, and then thrown away when I look at only the \c gsl_matrix part of that set. \li If the query returns zero rows of data or no columns, the function returns \c NULL. \li The query can include printf-style format specifiers, such as apop_query_to_vector("select age from %s where id=%i;", tablename, id_number). */ gsl_vector * apop_query_to_vector(const char * fmt, ...){ Fillin(query, fmt) if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL return apop_mysql_query_core(query, process_result_set_vector); #else Apop_stopif(1, return NULL, 0, "Apophenia was compiled without mysql support."); #endif apop_data *d=NULL; gsl_vector *out; if (db==NULL) apop_db_open(NULL); Store_settings d = apop_query_to_data("%s", query); Restore_settings Apop_stopif(!d, return NULL, 2, "Query [%s] turned up a blank table. Returning NULL.", query); //else: out = gsl_vector_alloc(d->matrix->size1); gsl_matrix_get_col(out, d->matrix, 0); apop_data_free(d); free(query); return out; } /** Queries the database, and dumps the result into a single double-precision floating point number. \li This calls \ref apop_query_to_data and returns the (0,0)th element of the returned matrix. Thus, if your query returns multiple lines, you will get no warning, and the function will return the first in the list (which is not always well-defined; maybe use an order by clause in your query if you expect multiple lines). \li If \c apop_opts.db_name_column is set, then I'll ignore that column. It gets put into the names of the \ref apop_data set, and then thrown away when I look at only the \c gsl_matrix element of that set. \li If the query produces a blank table, returns \c NAN, and if apop_opts.verbose>=2, prints an error. \li The query can include printf-style format specifiers, such as apop_query_to_float("select age from %s where id=%i;", tablename, id_number). \li If the query produces an error, returns \c NAN, and if apop_opts.verbose>=0, prints an error. If you need to distinguish between blank tables, NaNs in the data, and query errors, use \ref apop_query_to_data. \param fmt A printf-style SQL query. \return A \c double, actually. */ double apop_query_to_float(const char * fmt, ...){ double out; Fillin(query, fmt) if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm'){ #ifdef HAVE_MYSQL out = apop_mysql_query_to_float(query); #else Apop_stopif(1, return NAN, 0, "Apophenia was compiled without mysql support."); #endif } else { apop_data *d=NULL; if (db==NULL) apop_db_open(NULL); Store_settings d = apop_query_to_data("%s", query); Restore_settings Apop_stopif(!d, return GSL_NAN, 2, "Query [%s] turned up a blank table. Returning NaN.", query); Apop_stopif(d->error, return GSL_NAN, 0, "Query [%s] failed. Returning NaN.", query); out = apop_data_get(d); apop_data_free(d); } free(query); return out; } /** Query data to an \c apop_data set, but a mix of names, vectors, matrix elements, and text. If you are querying to a matrix and maybe a name, use \c apop_query_to_data (and set \ref apop_opts_type "apop_opts.db_name_column" if desired). If querying only text, use \ref apop_query_to_text. But if your data is a mix of text and numbers, use this. The first argument is a character string consisting of the letters \c nvmtw, one for each column of the SQL output, indicating whether the column is a name, vector, matrix column, text column, or weight vector. You can have only one \c n, one \c v, and one \c w. If the query produces more columns than there are elements in the column specification, then the remainder are dumped into the text section. If there are fewer columns produced than given in the spec, the additional elements will be allocated but not filled (i.e., they are uninitialized and will have garbage). \param typelist A string consisting of the letters \c nvmtw. For example, if your query columns should go into a text column, the vector, the weights, and two matrix columns, this would be "tvwmm". \param fmt A printf-style SQL query. \exception out->error=='d' Dimension error. Your count of matrix parts didn't match what the query returned. \exception out->error=='q' Query error. A valid query that returns no rows is not an error; in that case, you get \c NULL. \li \ref apop_opts_type "apop_opts.db_name_column" is ignored. Use the \c 'n' character to indicate the output column with row names. \li As with the other \c apop_query_to_... functions, the query can include printf-style format specifiers, such as apop_query_to_mixed_data("tv", "select name, age from */ apop_data * apop_query_to_mixed_data(const char *typelist, const char * fmt, ...){ Fillin(query, fmt) if (!apop_opts.db_engine) get_db_type(); if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL {apop_data* out = apop_mysql_mixed_query(typelist, query); free(query); return out;} #else {Apop_notify(0, "Apophenia was compiled without mysql support."); return 0;} #endif //else apop_data *out = apop_sqlite_multiquery(typelist, query); free(query); return out; } /* Convenience function for extending a string. asprintf(%q, "%s and stuff", q); gives you a memory leak. This takes care of that. */ void qxprintf(char **q, char *format, ...){ va_list ap; char *r = *q; va_start(ap, format); Apop_stopif(vasprintf(q, format, ap)==-1, , 0, "Trouble writing to a string."); va_end(ap); free(r); } static void add_a_number (char **q, char *comma, double v){ if (gsl_isnan(v)) qxprintf(q,"%s%c NULL ", *q, *comma); else if (isinf(v)==1) qxprintf(q,"%s%c 'inf'", *q, *comma); else if (isinf(v)==-1) qxprintf(q,"%s%c '-inf' ", *q, *comma); else qxprintf(q,"%s%c %g ",*q ,*comma, v); *comma = ','; } static int run_prepared_statements(apop_data const *set, sqlite3_stmt *p_stmt){ #if SQLITE_VERSION_NUMBER < 3003009 Apop_stopif(1, return -1, 0, "Attempting to use prepared statements, but using a version of SQLite that doesn't support them."); #else Get_vmsizes(set) //firstcol, msize1, maxsize for (size_t row=0; row < maxsize; row++){ size_t field =1; if (set->names && set->names->rowct>row){ if (!strlen(set->names->row[row])) field++; //leave NULL and cleared Apop_stopif(sqlite3_bind_text(p_stmt, field++, set->names->row[row], -1, SQLITE_TRANSIENT), return -1, apop_errorlevel, "Something wrong with the row name for line %zu, [%s].\n" , row, set->names->row[row]); } if (set->vector && set->vector->size > row) Apop_stopif(sqlite3_bind_double(p_stmt, field++, apop_data_get(set, row, -1)), return -1, apop_errorlevel, "Something wrong with the vector element on line %zu, [%g].\n" ,row, apop_data_get(set, row, -1)); if (msize1 > row) for (size_t col=0; col < msize2; col++) Apop_stopif(sqlite3_bind_double(p_stmt, field++, apop_data_get(set, row, col)), return -1, apop_errorlevel, "Something wrong with the matrix element %zu on line %zu, [%g].\n" ,col, row, apop_data_get(set, row, col)); if (*set->textsize > row) for (size_t col=0; col < set->textsize[1]; col++){ if (!strlen(set->text[row][col]) || (apop_opts.nan_string && !strcasecmp(apop_opts.nan_string, set->text[row][col]))) {field++; continue;} //leave NULL and cleared Apop_stopif(sqlite3_bind_text(p_stmt, field++, set->text[row][col], -1, SQLITE_TRANSIENT), return -1, apop_errorlevel, "Something wrong with a text element at row %zu, col %zu [%s].\n" , row, col, set->text[row][col]); } if (set->weights && set->weights->size > row) Apop_stopif(sqlite3_bind_double(p_stmt, field++, gsl_vector_get(set->weights, row)), return -1, apop_errorlevel, "Something wrong with the weight element on line %zu, [%g].\n" ,row, gsl_vector_get(set->weights, row)); int err = sqlite3_step(p_stmt); Apop_stopif(err!=0 && err != 101 //0=ok, 101=done , , 0, "prepared sqlite insert query gave error code %i.\n", err); Apop_stopif(sqlite3_reset(p_stmt), return -1, apop_errorlevel, "SQLite error."); Apop_stopif(sqlite3_clear_bindings(p_stmt), return -1, apop_errorlevel, "SQLite error."); //needed for NULLs } Apop_stopif(sqlite3_finalize(p_stmt)!=SQLITE_OK, return -1, apop_errorlevel, "SQLite error."); return 0; #endif } //users are expected to call apop_data_print. int apop_data_to_db(const apop_data *set, const char *tabname, const char output_append){ Apop_stopif(!set, return -1, 1, "you sent me a NULL data set. Database table %s will not be created.", tabname); int i,j; char *q; char comma = ' '; int use_row = (apop_opts.db_name_column && strlen(apop_opts.db_name_column)) && set->names && ((set->matrix && set->names->rowct == set->matrix->size1) || (set->vector && set->names->rowct == set->vector->size)); if (!apop_opts.db_engine) get_db_type(); if (apop_table_exists(tabname)) Asprintf(&q, " "); else if (apop_opts.db_engine == 'm') #ifdef HAVE_MYSQL if (((output_append =='a' || output_append =='A') && apop_table_exists(tabname))) Asprintf(&q, " "); else { Asprintf(&q, "create table %s (", tabname); if (use_row) { qxprintf(&q, "%s\n %s varchar(1000)", q, apop_opts.db_name_column); comma = ','; } if (set->vector){ if(!set->names || !set->names->vector) qxprintf(&q, "%s%c\n vector double ", q, comma); else qxprintf(&q, "%s%c\n %s double ", q,comma, set->names->vector); comma = ','; } if (set->matrix) for(i=0;i< set->matrix->size2; i++){ if(!set->names || set->names->colct <= i) qxprintf(&q, "%s%c\n c%i double ", q, comma,i); else qxprintf(&q, "%s%c\n %s double ", q, comma, set->names->col[i]); comma = ','; } for(i=0;i< set->textsize[1]; i++){ if (!set->names || set->names->textct <= i) qxprintf(&q, "%s%c\n tc%i varchar(1000) ", q, comma,i); else qxprintf(&q, "%s%c\n %s varchar(1000) ", q, comma, set->names->text[i]); comma = ','; } apop_query("%s); ", q); sprintf(q, " "); } #else Apop_stopif(1, return -1, apop_errorlevel, "Apophenia was compiled without mysql support."); #endif else { if (db==NULL) apop_db_open(NULL); if (((output_append =='a' || output_append =='A') && apop_table_exists(tabname)) ) Asprintf(&q, " "); else { Asprintf(&q, "create table %s (", tabname); if (use_row) { qxprintf(&q, "%s\n %s", q, apop_opts.db_name_column); comma = ','; } if (set->vector){ if (!set->names || !set->names->vector) qxprintf(&q, "%s%c\n vector numeric", q, comma); else qxprintf(&q, "%s%c\n \"%s\"", q, comma, set->names->vector); comma = ','; } if (set->matrix) for(i=0;i< set->matrix->size2; i++){ if(!set->names || set->names->colct <= i) qxprintf(&q, "%s%c\n c%i numeric", q, comma,i); else qxprintf(&q, "%s%c\n \"%s\" numeric", q, comma, set->names->col[i]); comma = ','; } for(i=0; i< set->textsize[1]; i++){ if(!set->names || set->names->textct <= i) qxprintf(&q, "%s%c\n tc%i ", q, comma, i); else qxprintf(&q, "%s%c\n %s ", q, comma, set->names->text[i]); comma = ','; } if (set->weights) qxprintf(&q, "%s%c\n \"weights\" numeric", q, comma); qxprintf(&q,"%s);",q); apop_query("%s", q); qxprintf(&q," "); } } Get_vmsizes(set) //firstcol, msize2, maxsize int col_ct = (set->names ? !!set->names->rowct : 0) + set->textsize[1] + msize2 - firstcol + !!set->weights; Apop_stopif(!col_ct, return -1, 0, "Input data set has zero columns of data (no rownames, text, matrix, vector, or weights). I can't create a table like that, sorry."); if(apop_use_sqlite_prepared_statements(col_ct)){ sqlite3_stmt *statement; Apop_stopif( apop_prepare_prepared_statements(tabname, col_ct, &statement), return -1, 0, "Trouble preparing prepared statements."); Apop_stopif( run_prepared_statements(set, statement), return -1, 0, "error in insertions."); } else { for(i=0; i< maxsize; i++){ comma = ' '; qxprintf(&q, "%s \n insert into %s values(",q, tabname); if (use_row){ char *fixed= prep_string_for_sqlite(0, set->names->row[i]); qxprintf(&q, "%s %s ",q, fixed); free(fixed); comma = ','; } if (set->vector) add_a_number (&q, &comma, gsl_vector_get(set->vector,i)); if (set->matrix) for(j=0; j< set->matrix->size2; j++) add_a_number (&q, &comma, gsl_matrix_get(set->matrix,i,j)); for(j=0; j< set->textsize[1]; j++){ char *fixed= prep_string_for_sqlite(0, set->text[i][j]); qxprintf(&q, "%s%c %s ",q, comma,fixed ? fixed : "''"); free(fixed); comma = ','; } if (set->weights) add_a_number (&q, &comma, gsl_vector_get(set->weights,i)); qxprintf(&q,"%s);",q); apop_query("%s", q); q[0]='\0'; } } free(q); return 0; } apophenia-1.0+ds/apop_db_mysql.c000066400000000000000000000245271262736346100167410ustar00rootroot00000000000000/** \file apop_db_mysql.c This file is included directly into \ref apop_db.c. It is read only if APOP_USE_MYSQL is defined.*/ /* Copyright (c) 2006--2007 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include #include #include #include static MYSQL *mysql_db; #define Areweconected(retval) Apop_stopif(!mysql_db, return retval, 0, \ "No connection to a mySQL/mariadb database. apop_db_open() failure?"); static char *opt_host_name = NULL; /* server host (default=localhost) */ static unsigned int opt_port_num = 0; /* port number (use built-in value) */ static char *opt_socket_name = NULL; /* socket name (use built-in value) */ static unsigned int opt_flags = 0; /* connection flags (none) */ #define Apop_mstopif(cond, returnop, str) \ Apop_stopif(cond, returnop, 0, \ str "\n mySQL/mariadb error %u: %s\n", mysql_errno (mysql_db), mysql_error (mysql_db)); static int apop_mysql_db_open(char const *in){ Apop_stopif(!in, return 2, 0, "MySQL needs a non-NULL db name."); mysql_db = mysql_init (NULL); Apop_stopif(!mysql_db, return 1, 0, "mysql_init() failed (probably out of memory)"); Apop_mstopif (!mysql_real_connect (mysql_db, opt_host_name, apop_opts.db_user, apop_opts.db_pass, in, opt_port_num, opt_socket_name, CLIENT_MULTI_STATEMENTS+opt_flags), mysql_close (mysql_db); return 1, "mysql_real_connect() failed"); return 0; } static void apop_mysql_db_close(int ignoreme){ if (mysql_db) mysql_close (mysql_db); } /* //Cut & pasted & cleaned from the mysql manual. static void process_results(void){ else // mysql_store_result() returned nothing; should it have? Apop_stopif(mysql_field_count(mysql_db) == 0, , 0, "apop_query error"); //else query wasn't a select & just didn't return data. } */ static double apop_mysql_query(char *query){ Apop_mstopif(mysql_query(mysql_db, query), return 1, "apop_mysql_query failed"); MYSQL_RES *result = mysql_store_result(mysql_db); if (result) mysql_free_result(result); return 0; } static double apop_mysql_table_exists(char const *table, int delme){ Areweconected(GSL_NAN); MYSQL_RES *res_set = mysql_list_tables(mysql_db, table); Apop_mstopif(!mysql_list_tables(mysql_db, table), return GSL_NAN, "show tables query failed."); int is_found = mysql_num_rows(res_set); mysql_free_result(res_set); if (!is_found) return 0; if (delme =='d' || delme=='D'){ char *a_query; Asprintf(&a_query, "drop table %s", table); Apop_mstopif(mysql_query (mysql_db, a_query), GSL_NAN, "table exists, but table dropping failed"); } return 1; } #define check_and_clean(do_if_failure) \ Apop_mstopif( mysql_errno (conn), \ if (out) do_if_failure; return NULL, \ "mysql_fetch_row() failed"); \ return out; \ static int get_name_row(unsigned int *num_fields, MYSQL_FIELD *fields){ for(size_t i = 0; i < *num_fields; i++) if (apop_opts.db_name_column && !strcasecmp(fields[i].name, apop_opts.db_name_column)){ (*num_fields)--; return i; } return -1; } static void * process_result_set_data (MYSQL *conn, MYSQL_RES *res_set) { MYSQL_ROW row; unsigned int num_fields = mysql_num_fields(res_set); unsigned int num_rows = mysql_num_rows (res_set); if (!num_fields || !num_rows) return NULL; MYSQL_FIELD *fields = mysql_fetch_fields(res_set); int name_row = get_name_row(&num_fields, fields); apop_data *out = apop_data_alloc(0, num_rows, num_fields); for(size_t i = 0; i < num_fields+ (name_row>=0); i++) if (i!=name_row) apop_name_add(out->names, fields[i].name, 'c'); for (int i=0; (row = mysql_fetch_row (res_set)); i++) { int passed_name = 0; for (size_t j = 0; j < mysql_num_fields (res_set); j++){ if (j==name_row){ apop_name_add(out->names, row[j], 'r'); passed_name = 1; continue; } if (!row[j]) apop_data_set(out, i , j-passed_name, NAN); else { char *end = NULL; double num = strtod(row[j], &end); apop_data_set(out, i , j-passed_name, *end ? NAN : num); } } } check_and_clean(apop_data_free(out)) } static void * process_result_set_vector (MYSQL *conn, MYSQL_RES *res_set) { MYSQL_ROW row; unsigned int num_fields = mysql_num_fields(res_set); unsigned int num_rows = mysql_num_rows (res_set); if (num_fields == 0 || num_rows == 0) return NULL; gsl_vector *out = gsl_vector_alloc(num_rows); for (int j=0; (row = mysql_fetch_row (res_set)); j++){ double valor = (!row[0] || !strcmp(row[0], "NULL")) ? GSL_NAN : atof(row[0]); gsl_vector_set(out, j, valor); } check_and_clean(gsl_vector_free(out)) } static void * process_result_set_chars (MYSQL *conn, MYSQL_RES *res_set) { MYSQL_ROW row; unsigned int total_cols = mysql_num_fields(res_set); unsigned int total_rows = mysql_num_rows(res_set); MYSQL_FIELD *fields = mysql_fetch_fields(res_set); int name_row = get_name_row(&total_cols, fields); apop_data *out = apop_text_alloc(NULL, total_rows, total_cols); for (size_t i = 0; i < total_cols + (name_row>=0); i++) if (i!=name_row) apop_name_add(out->names, fields[i].name, 't'); for (int i=0; (row = mysql_fetch_row (res_set)); i++){ int passed_name = 0; for (size_t jj=0; jjnames, row[jj], 'r'); passed_name = 1; continue; } apop_text_set(out, i, jj-passed_name, "%s", (row[jj]==NULL)? apop_opts.nan_string : row[jj]); } } check_and_clean(;) } static void * apop_mysql_query_core(char *query, void *(*callback)(MYSQL*, MYSQL_RES*)){ Areweconected(NULL); apop_data *output = NULL; Apop_mstopif(mysql_query (mysql_db, query), return NULL, "mysql_query() failed"); MYSQL_RES *res_set = mysql_store_result (mysql_db); Apop_mstopif(!res_set, if (callback == process_result_set_data || callback==process_result_set_data) apop_return_data_error('q') else return NULL, "mysql_store_result() failed"); if (!res_set->row_count) goto done; //just a blank table. output = callback(mysql_db, res_set); done: mysql_free_result (res_set); return output; } static double apop_mysql_query_to_float(char *query){ Areweconected(GSL_NAN); Apop_mstopif(mysql_query (mysql_db, query) != 0, return GSL_NAN, "mysql_query() failed"); MYSQL_RES *res_set = mysql_store_result (mysql_db); Apop_mstopif(!res_set, return GSL_NAN, "mysql_store_result() failed"); if (mysql_num_rows(res_set)==0) return GSL_NAN; MYSQL_ROW row = mysql_fetch_row (res_set); Apop_mstopif(mysql_errno (mysql_db), mysql_free_result (res_set); return GSL_NAN, "mysql_fetch_row() failed"); double out = atof(row[0]); mysql_free_result (res_set); return out; } apop_data* apop_mysql_mixed_query(char const *intypes, char const *query){ Areweconected(NULL); apop_data *out = NULL; Apop_mstopif(mysql_query (mysql_db, query), return NULL, "mysql_query() failed"); MYSQL_RES *res_set = mysql_store_result(mysql_db); MYSQL_ROW row; Apop_mstopif(!res_set, return NULL, "mysql_store_result() failed"); if (!res_set->row_count) goto done; //just a blank table. unsigned int total_cols = mysql_num_fields(res_set); unsigned int total_rows = mysql_num_rows(res_set); if (!total_cols || !total_rows) goto done; apop_qt info = { }; count_types(&info, intypes); //in apop_db_sqlite.c //intypes[5] === names, vectors, mcols, textcols, weights. out = apop_data_alloc(info.intypes[1] ? total_rows : 0, info.intypes[2] ? total_rows : 0, info.intypes[2]); int requested = info.intypes[0]+info.intypes[1]+info.intypes[2]+info.intypes[3]+info.intypes[4]; int excess = requested - total_cols; Apop_stopif(excess > 0, out->error='d' /*and continue.*/, 1, "you asked for %i columns in your list of types(%s), but your query produced %u columns. " "The remainder will be placed in the text section. Output data set's ->error element set to 'd'." , requested, intypes, total_cols); Apop_stopif(excess < 0, out->error='d' /*and continue.*/, 1, "you asked for %i columns in your list of types(%s), but your query produced %u columns. " "Ignoring the last %i type(s) in your list. Output data set's ->error element set to 'd'." , requested, intypes, total_cols, -excess); if (info.intypes[3]||excess>0) apop_text_alloc(out, total_rows, info.intypes[3] + ((excess > 0) ? excess : 0)); if (info.intypes[4]) out->weights = gsl_vector_alloc(total_rows); MYSQL_FIELD *fields = mysql_fetch_fields(res_set); for (size_t i=0; inames, fields[i].name, 't'); else if (c == 'v'|| c=='V') apop_name_add(out->names, fields[i].name, 'v'); else if (c == 'm'|| c=='M') apop_name_add(out->names, fields[i].name, 'c'); } for (int i=0; (row = mysql_fetch_row (res_set)); i++) { int thism=0, thist=0; for (size_t j=0; jnames, row[j], 'r'); else if (c == 't'|| c=='T') apop_text_set(out, i, thist++, "%s", (row[j]==NULL)? apop_opts.nan_string : row[j]); else if (c == 'v'|| c=='V'){ double valor = (!row[j] || !strcmp(row[j], "NULL")) ? NAN : atof(row[j]); gsl_vector_set(out->vector, i, valor); } else if (c == 'w'|| c=='W'){ double valor = (!row[j] || !strcmp(row[j], "NULL")) ? NAN : atof(row[j]); gsl_vector_set(out->weights, i, valor); } else if (c == 'm'|| c=='M') gsl_matrix_set(out->matrix, i , thism++, row[j] ? atof(row[j]): GSL_NAN); } } done: mysql_free_result (res_set); return out; } apophenia-1.0+ds/apop_db_sqlite.c000066400000000000000000000323301262736346100170640ustar00rootroot00000000000000/** \file apop_db_sqlite.c This file is included directly into \ref apop_db.c. Copyright (c) 2006--2007 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include #include sqlite3 *db=NULL; //There's only one SQLite database handle. Here it is. /** \cond doxy_ignore */ typedef struct StdDevCtx StdDevCtx; struct StdDevCtx { double avg; /* avg of terms */ double avg2; /* avg of the squares of terms */ double avg3; /* avg of the cube of terms */ double avg4; /* avg of the fourth-power of terms */ int cnt; /* Number of terms counted */ }; /** \endcond */ static void twoStep(sqlite3_context *context, int argc, sqlite3_value **argv){ if (argc<1) return; StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if (p && argv[0]){ double x = sqlite3_value_double(argv[0]); double ratio = p->cnt/(p->cnt+1.0); p->cnt++; p->avg *= ratio; p->avg2 *= ratio; p->avg += x/(p->cnt +0.0); p->avg2 += gsl_pow_2(x)/(p->cnt +0.0); } } static void threeStep(sqlite3_context *context, int argc, sqlite3_value **argv){ if (argc<1) return; StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if (p && argv[0]){ double x = sqlite3_value_double(argv[0]); double ratio = p->cnt/(p->cnt+1.0); p->cnt++; p->avg *= ratio; p->avg2 *= ratio; p->avg3 *= ratio; p->avg += x/p->cnt; p->avg2 += gsl_pow_2(x)/p->cnt; p->avg3 += gsl_pow_3(x)/p->cnt; } } static void fourStep(sqlite3_context *context, int argc, sqlite3_value **argv){ if( argc<1 ) return; StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if (p && argv[0]){ double x = sqlite3_value_double(argv[0]); p->cnt++; p->avg = (x + p->avg * (p->cnt-1.))/p->cnt; p->avg2 = (gsl_pow_2(x)+ p->avg2 * (p->cnt-1.))/p->cnt; p->avg3 = (gsl_pow_3(x)+ p->avg3 * (p->cnt-1.))/p->cnt; p->avg4 = (gsl_pow_4(x)+ p->avg4 * (p->cnt-1.))/p->cnt; } } static void stdDevFinalizePop(sqlite3_context *context){ StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if (p && p->cnt>1) sqlite3_result_double(context, sqrt((p->avg2 - gsl_pow_2(p->avg)))); else if (p->cnt == 1) sqlite3_result_double(context, 0); } static void varFinalizePop(sqlite3_context *context){ StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if( p && p->cnt>1 ) sqlite3_result_double(context, (p->avg2 - gsl_pow_2(p->avg))); else if (p->cnt == 1) sqlite3_result_double(context, 0); } static void stdDevFinalize(sqlite3_context *context){ StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if( p && p->cnt>1 ){ double rCnt = p->cnt; sqlite3_result_double(context, sqrt((p->avg2 - gsl_pow_2(p->avg))*rCnt/(rCnt-1.0))); } else if (p->cnt == 1) sqlite3_result_double(context, 0); } static void varFinalize(sqlite3_context *context){ StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if( p && p->cnt>1 ){ double rCnt = p->cnt; sqlite3_result_double(context, (p->avg2 - gsl_pow_2(p->avg))*rCnt/(rCnt-1.0)); } else if (p->cnt == 1) sqlite3_result_double(context, 0); } static void skewFinalize(sqlite3_context *context){ StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if( p && p->cnt>1 ){ double rCnt = p->cnt; sqlite3_result_double(context, (p->avg3*rCnt - 3*p->avg2*p->avg*rCnt + 2*rCnt * gsl_pow_3(p->avg)) * rCnt/((rCnt-1.0)*(rCnt-2.0))); } else if (p->cnt == 1) sqlite3_result_double(context, 0); } static void kurtFinalize(sqlite3_context *context){ StdDevCtx *p = sqlite3_aggregate_context(context, sizeof(*p)); if( p && p->cnt>1 ){ double n = p->cnt; double kurtovern = p->avg4 - 4*p->avg3*p->avg + 6 * p->avg2*gsl_pow_2(p->avg) - 3* gsl_pow_4(p->avg); double var = p->avg2 - gsl_pow_2(p->avg); long double coeff0= n*n/(gsl_pow_3(n)*(gsl_pow_2(n)-3*n+3)); long double coeff1= n*gsl_pow_2(n-1)+ (6*n-9); long double coeff2= n*(6*n-9); sqlite3_result_double(context, coeff0*(coeff1 * kurtovern + coeff2 * gsl_pow_2(var))); } else if (p->cnt == 1) sqlite3_result_double(context, 0); } static void powFn(sqlite3_context *context, int argc, sqlite3_value **argv){ double base = sqlite3_value_double(argv[0]); double exp = sqlite3_value_double(argv[1]); sqlite3_result_double(context, pow(base, exp)); } static void rngFn(sqlite3_context *context, int argc, sqlite3_value **argv){ Staticdef(gsl_rng *, rng, apop_rng_alloc(apop_opts.rng_seed++)); //sqlite3_result_double(context, gsl_rng_uniform(rng)); sqlite3_result_double(context, gsl_rng_uniform(apop_rng_get_thread(-1))); } #define sqfn(name) static void name##Fn(sqlite3_context *context, int argc, sqlite3_value **argv){ \ sqlite3_result_double(context, name(sqlite3_value_double(argv[0]))); } sqfn(sqrt) sqfn(exp) sqfn(log) sqfn(log10) sqfn(sin) sqfn(cos) sqfn(tan) sqfn(asin) sqfn(acos) sqfn(atan) static int apop_sqlite_db_open(char const *filename){ int status = sqlite3_open(filename ? filename : ":memory:", &db); Apop_stopif(status, db=NULL; return status, 0, "The database %s didn't open.", filename ? filename : "in memory"); sqlite3_create_function(db, "stddev", 1, SQLITE_ANY, NULL, NULL, &twoStep, &stdDevFinalize); sqlite3_create_function(db, "std", 1, SQLITE_ANY, NULL, NULL, &twoStep, &stdDevFinalizePop); sqlite3_create_function(db, "stddev_samp", 1, SQLITE_ANY, NULL, NULL, &twoStep, &stdDevFinalize); sqlite3_create_function(db, "stddev_pop", 1, SQLITE_ANY, NULL, NULL, &twoStep, &stdDevFinalizePop); sqlite3_create_function(db, "var", 1, SQLITE_ANY, NULL, NULL, &twoStep, &varFinalize); sqlite3_create_function(db, "var_samp", 1, SQLITE_ANY, NULL, NULL, &twoStep, &varFinalize); sqlite3_create_function(db, "var_pop", 1, SQLITE_ANY, NULL, NULL, &twoStep, &varFinalizePop); sqlite3_create_function(db, "variance", 1, SQLITE_ANY, NULL, NULL, &twoStep, &varFinalizePop); sqlite3_create_function(db, "skew", 1, SQLITE_ANY, NULL, NULL, &threeStep, &skewFinalize); sqlite3_create_function(db, "kurt", 1, SQLITE_ANY, NULL, NULL, &fourStep, &kurtFinalize); sqlite3_create_function(db, "kurtosis", 1, SQLITE_ANY, NULL, NULL, &fourStep, &kurtFinalize); sqlite3_create_function(db, "ln", 1, SQLITE_ANY, NULL, &logFn, NULL, NULL); sqlite3_create_function(db, "ran", 0, SQLITE_ANY, NULL, &rngFn, NULL, NULL); sqlite3_create_function(db, "pow", 2, SQLITE_ANY, NULL, &powFn, NULL, NULL); #define sqlink(name) sqlite3_create_function(db, #name , 1, SQLITE_ANY, NULL, &name##Fn, NULL, NULL); sqlink(sqrt) sqlink(exp) sqlink(sin) sqlink(cos) sqlink(tan) sqlink(asin) sqlink(acos) sqlink(atan) sqlink(log) sqlink(log10) apop_query("pragma short_column_names"); return 0; } /** \cond doxy_ignore */ typedef struct { //for the apop_query_to_... functions. int firstcall, namecol; size_t currentrow; apop_data *outdata; } callback_t; /** \endcond */ //This is the callback for apop_query_to_text. static int db_to_chars(void *qinfo,int argc, char **argv, char **column){ callback_t *qi= qinfo; apop_data* d = qi->outdata; //alias. Allocated in calling fn. int addnames = 0, ncshift=0; if (!d->names->textct) addnames++; if (qi->firstcall){ qi->firstcall = 0; for(int i=0; inamecol = i; break; } } int rows = d->textsize[0]; int cols = argc - (qi->namecol >= 0); apop_text_alloc(d, rows+1, cols);//doesn't move d. for (size_t jj=0; jjnamecol){ apop_name_add(d->names, argv[jj], 'r'); ncshift ++; } else { apop_text_set(d, rows, jj-ncshift, (argv[jj]==NULL)? apop_opts.nan_string: argv[jj]); //Asprintf(&(d->text[rows][jj-ncshift]), "%s", (argv[jj]==NULL)? "NaN": argv[jj]); if(addnames) apop_name_add(d->names, column[jj], 't'); } return 0; } apop_data * apop_sqlite_query_to_text(char *query){ char *err = NULL; callback_t qinfo = {.outdata=apop_data_alloc(), .namecol=-1, .firstcall=1}; if (db==NULL) apop_db_open(NULL); sqlite3_exec(db, query, db_to_chars, &qinfo, &err); ERRCHECK_SET_ERROR(qinfo.outdata) if (qinfo.outdata->textsize[0]==0){ apop_data_free(qinfo.outdata); return NULL; } return qinfo.outdata; } /** \cond doxy_ignore */ typedef struct { apop_data *d; int intypes[5];//names, vectors, mcols, textcols, weights. int current, thisrow, error_thrown; const char *instring; } apop_qt; /** \endcond */ static void count_types(apop_qt *in, const char *intypes){ int i = 0; char c; in->instring = intypes; while ((c=intypes[i++])) if (c=='n'||c=='N') in->intypes[0]++; else if (c=='v'||c=='V') in->intypes[1]++; else if (c=='m'||c=='M') in->intypes[2]++; else if (c=='t'||c=='T') in->intypes[3]++; else if (c=='w'||c=='W') in->intypes[4]++; if (in->intypes[0]>1) Apop_notify(1, "You asked apop_query_to_mixed data for multiple row names. I'll ignore all but the last one."); if (in->intypes[1]>1) Apop_notify(1, "You asked apop_query_to_mixed for multiple vectors. I'll ignore all but the last one."); if (in->intypes[4]>1) Apop_notify(1, "You asked apop_query_to_mixed for multiple weighting vectors. I'll ignore all but the last one."); } static int multiquery_callback(void *instruct, int argc, char **argv, char **column){ apop_qt *in = instruct; char c; int thistcol = 0, thismcol = 0, colct = 0, i, addnames = 0; in->thisrow ++; if (!in->d) { in->d = in->intypes[2] ? apop_data_alloc(in->intypes[1], 1, in->intypes[2]) : apop_data_alloc(in->intypes[1]); if (in->intypes[4]) in->d->weights = gsl_vector_alloc(1); if (in->intypes[3]){ in->d->textsize[0] = 1; in->d->textsize[1] = in->intypes[3]; in->d->text = malloc(sizeof(char***)); } } if (!(in->d->names->colct + in->d->names->textct + (in->d->names->vector!=NULL))) addnames++; if (in->d->textsize[1]){ in->d->textsize[0] = in->thisrow; in->d->text = realloc(in->d->text, sizeof(char ***)*in->thisrow); in->d->text[in->thisrow-1] = malloc(sizeof(char**) * in->d->textsize[1]); } if (in->intypes[2]) apop_matrix_realloc(in->d->matrix, in->thisrow, in->intypes[2]); for (i=in->current=0; i< argc; i++){ c = in->instring[in->current++]; if (c=='n'||c=='N'){ apop_name_add(in->d->names, (argv[i]? argv[i] : "NaN") , 'r'); if(addnames) apop_name_add(in->d->names, column[i], 'h'); } else if (c=='v'||c=='V'){ apop_vector_realloc(in->d->vector, in->thisrow); apop_data_set(in->d, in->thisrow-1, -1, argv[i] ? atof(argv[i]) : GSL_NAN); if(addnames) apop_name_add(in->d->names, column[i], 'v'); } else if (c=='m'||c=='M'){ apop_data_set(in->d, in->thisrow-1, thismcol++, argv[i] ? atof(argv[i]) : GSL_NAN); if(addnames) apop_name_add(in->d->names, column[i], 'c'); } else if (c=='t'||c=='T'){ Asprintf(&(in->d->text[in->thisrow-1][thistcol++]), "%s", argv[i] ? argv[i] : "NaN"); if(addnames) apop_name_add(in->d->names, column[i], 't'); } else if (c=='w'||c=='W'){ apop_vector_realloc(in->d->weights, in->thisrow); gsl_vector_set(in->d->weights, in->thisrow-1, argv[i] ? atof(argv[i]) : GSL_NAN); } colct++; } int requested = in->intypes[0]+in->intypes[1]+in->intypes[2]+in->intypes[3]+in->intypes[4]; Apop_stopif(colct != requested, in->error_thrown='d'; return 1, 1, "you asked for %i columns in your list of types(%s), but your query produced %u columns. " "The remainder will be placed in the text section. Output data set's ->error element set to 'd'." , requested, in->instring, colct); return 0; } apop_data *apop_sqlite_multiquery(const char *intypes, char *query){ Apop_stopif(!intypes, apop_return_data_error('t'), 0, "You gave me NULL for the list of input types. I can't work with that."); Apop_stopif(!query, apop_return_data_error('q'), 0, "You gave me a NULL query. I can't work with that."); char *err = NULL; apop_qt info = { }; count_types(&info, intypes); if (!db) apop_db_open(NULL); sqlite3_exec(db, query, multiquery_callback, &info, &err); Apop_stopif(info.error_thrown, if (!info.d) apop_data_alloc(); info.d->error='d'; return info.d, 0, "dimension error"); ERRCHECK_SET_ERROR(info.d) return info.d; } apophenia-1.0+ds/apop_fexact.c000066400000000000000000001551471262736346100164040ustar00rootroot00000000000000/** \file apop_fexact.c Fisher's exact test for contingency tables This file primarily consists of an algorithm from the ACM, fully documented below. The C code below was cut and pasted from the R project. Thanks, guys. Un-R-ifying modifications Copyright (c) 2006--2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. R version credits: fexact.f -- translated by f2c (version 19971204).\\ Run through a slightly modified version of MM's f2c-clean.\\ Heavily hand-edited by KH and MM. */ #include "apop_internal.h" #include #include #include /* These are the R-specific items. */ typedef enum { FALSE = 0, TRUE /*, MAYBE */ } Rboolean; int imax2(int a, int b){return (a>b) ? a : b;} int imin2(int a, int b){return (ab) ? a : b;} float fmin2(float a, float b){return (a MULT is now an __argument__ of the function 3. FEXACT may be converted to single precision by setting IREAL = 3, and converting all DOUBLE PRECISION specifications (except the specifications for RWRK, IWRK, and DWRK) to REAL. This will require changing the names and specifications of the intrinsic functions ALOG, AMAX1, AMIN1, EXP, and REAL. In addition, the machine specific constants will need to be changed, and the name DWRK will need to be changed to RWRK in the call to F2XACT. 4. Machine specific constants are specified and documented in F2XACT. A missing value code is specified in both FEXACT and F2XACT. 5. Although not a restriction, is is not generally practical to call this routine with large tables which are not sparse and in which the 'hybrid' algorithm has little effect. For example, although it is feasible to compute exact probabilities for the table 1 8 5 4 4 2 2 5 3 3 4 3 1 0 10 1 4 0 0 0 0, computing exact probabilities for a similar table which has been enlarged by the addition of an extra row (or column) may not be feasible. ----------------------------------------------------------------------- */ /* To increase the length of the table of past path lengths relative to the length of the hash table, increase MULT. */ /* AMISS is a missing value indicator which is returned when the probability is not defined. */ const double amiss = GSL_NAN; /* Set IREAL = 4 for DOUBLE PRECISION Set IREAL = 3 for SINGLE PRECISION */ #define i_real 4 #define i_int 2 /* System generated locals */ int ikh; /* Local variables */ int nco, nro, ntot, numb, iiwk, irwk; int i, j, k, kk, ldkey, ldstp, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10; int i3a, i3b, i3c, i9a, iwkmax, iwkpt; /* Workspace Allocation */ double *equiv; iwkmax = 2 * (int) (*workspace / 2); equiv = (double *) calloc(iwkmax / 2, sizeof(double)); #define dwrk (equiv) #define iwrk ((int *)equiv) #define rwrk ((float *)equiv) iwkpt = 0; Apop_stopif(*nrow > *ldtabl, has_error=true; return, 0, "NROW must be less than or equal to LDTABL."); ntot = 0; for (i = 0; i < *nrow; ++i) { for (j = 0; j < *ncol; ++j) { if (table[i + j * *ldtabl] < 0) prterr(2, "All elements of TABLE may not be negative."); ntot += table[i + j * *ldtabl]; } } if (ntot == 0) { prterr(3, "All elements of TABLE are zero.\n" "PRT and PRE are set to missing values."); *pre = *prt = amiss; free(equiv); return; } /* nco := max(*nrow, *ncol) * nro := min(*nrow, *ncol) */ if(*ncol > *nrow) { nco = *ncol; nro = *nrow; } else { nco = *nrow; nro = *ncol; } k = *nrow + *ncol + 1; kk = k * nco; ikh = ntot + 1; i1 = iwork(iwkmax, &iwkpt, ikh, i_real); i2 = iwork(iwkmax, &iwkpt, nco, i_int); i3 = iwork(iwkmax, &iwkpt, nco, i_int); i3a = iwork(iwkmax, &iwkpt, nco, i_int); i3b = iwork(iwkmax, &iwkpt, nro, i_int); i3c = iwork(iwkmax, &iwkpt, nro, i_int); ikh = imax2(k * 5 + (kk << 1), nco * 7 + 800); iiwk= iwork(iwkmax, &iwkpt, ikh, i_int); ikh = imax2(nco + 401, k); irwk= iwork(iwkmax, &iwkpt, ikh, i_real); /* NOTE: What follows below splits the remaining amount iwkmax - iwkpt of (int) workspace into hash tables as follows. type size index INT 2 * ldkey i4 i5 i11 REAL 2 * ldkey i8 i9 i10 REAL 2 * ldstp i6 INT 6 * ldstp i7 Hence, we need ldkey times 3 * 2 + 3 * 2 * s + 2 * mult * s + 6 * mult chunks of integer memory, where s = sizeof(REAL) / sizeof(INT). If doubles are used and are twice as long as ints, this gives 18 + 10 * mult so that the value of ldkey can be obtained by dividing available (int) workspace by this number. In fact, because iwork() can actually s * n + s - 1 int chunks when allocating a REAL, we use ldkey = available / numb - 1. FIXME: Can we always assume that sizeof(double) / sizeof(int) is 2? */ if (i_real == 4) /* Double precision reals */ numb = 18 + 10 * *mult; else /* Single precision reals */ numb = (*mult << 3) + 12; ldkey = (iwkmax - iwkpt) / numb - 1; ldstp = *mult * ldkey; ikh = ldkey << 1; i4 = iwork(iwkmax, &iwkpt, ikh, i_int); ikh = ldkey << 1; i5 = iwork(iwkmax, &iwkpt, ikh, i_int); ikh = ldstp << 1; i6 = iwork(iwkmax, &iwkpt, ikh, i_real); ikh = ldstp * 6; i7 = iwork(iwkmax, &iwkpt, ikh, i_int); ikh = ldkey << 1; i8 = iwork(iwkmax, &iwkpt, ikh, i_real); ikh = ldkey << 1; i9 = iwork(iwkmax, &iwkpt, ikh, i_real); ikh = ldkey << 1; i9a = iwork(iwkmax, &iwkpt, ikh, i_real); ikh = ldkey << 1; i10 = iwork(iwkmax, &iwkpt, ikh, i_int); /* To convert to double precision, change RWRK to DWRK in the next CALL. */ f2xact(*nrow, *ncol, table, *ldtabl, expect, percnt, emin, prt, pre, dwrk + i1, iwrk + i2, iwrk + i3, iwrk + i3a, iwrk + i3b, iwrk + i3c, iwrk + i4, &ldkey, iwrk + i5, dwrk + i6, &ldstp, iwrk + i7, dwrk + i8, dwrk + i9, dwrk + i9a, iwrk + i10, iwrk + iiwk, dwrk + irwk); free(equiv); } #undef rwrk #undef iwrk #undef dwrk static void f2xact(int nrow, int ncol, int *table, int ldtabl, double *expect, double *percnt, double *emin, double *prt, double *pre, double *fact, int *ico, int *iro, int *kyy, int *idif, int *irn, int *key, int *ldkey, int *ipoin, double *stp, int *ldstp, int *ifrq, double *LP, double *SP, double *tm, int *key2, int *iwk, double *rwk) { /* ----------------------------------------------------------------------- Name: F2XACT Purpose: Computes Fisher's exact test for a contingency table, routine with workspace variables specified. ----------------------------------------------------------------------- */ const int imax = INT_MAX;/* the largest representable int on the machine.*/ /* AMISS is a missing value indicator which is returned when the probability is not defined. */ const double amiss = GSL_NAN; /* TOL is chosen as the square root of the smallest relative spacing. */ const static double tol = 3.45254e-7; const char* ch_err_5 = "The hash table key cannot be computed because the largest key\n" "is larger than the largest representable int.\n" "The algorithm cannot proceed.\n" "Reduce the workspace size or use another algorithm."; /* Local variables -- changed from "static" * (*does* change results very slightly on i386 linux) */ int i, ii, j, k, n, iflag,ifreq, ikkey, ikstp, ikstp2, ipn, ipo, itop, itp = 0, jkey, jstp, jstp2, jstp3, jstp4, k1, kb, kd, ks, kval = 0, kmax, last, ncell, ntot, nco, nro, nro2, nrb, i31, i32, i33, i34, i35, i36, i37, i38, i39, i41, i42, i43, i44, i45, i46, i47, i48, i310, i311; double dspt, d1,dd,df,ddf, drn,dro, obs, obs2, obs3, pastp,pv, tmp=0.; #ifndef USING_R double d2; int ifault; #endif Rboolean nr_gt_nc, maybe_chisq, chisq = FALSE/* -Wall */, psh; /* Parameter adjustments */ table -= ldtabl + 1; --ico; --iro; --kyy; --idif; --irn; --key; --ipoin; --stp; --ifrq; --LP; --SP; --tm; --key2; --iwk; --rwk; /* Check table dimensions */ Apop_stopif(nrow > ldtabl, has_error=true; return, 0, "NROW must be less than or equal to LDTABL."); Apop_stopif(ncol <= 1, has_error=true; return, 0, "NCOL must be at least 2"); /* Initialize KEY array */ for (i = 1; i <= *ldkey << 1; ++i) { key[i] = -9999; key2[i] = -9999; } nr_gt_nc = nrow > ncol; /* nco := max(nrow, ncol) : */ if (nr_gt_nc) nco = nrow; else nco = ncol; /* Compute row marginals and total */ ntot = 0; for (i = 1; i <= nrow; ++i) { iro[i] = 0; for (j = 1; j <= ncol; ++j) { Apop_stopif(table[i + j * ldtabl] < 0., has_error=true; return, 0, "All elements of TABLE must be non-negative."); iro[i] += table[i + j * ldtabl]; } ntot += iro[i]; } if (ntot == 0) { prterr(3, "All elements of TABLE are zero.\n" "PRT and PRE are set to missing values."); *pre = *prt = amiss; return; } /* Column marginals */ for (i = 1; i <= ncol; ++i) { ico[i] = 0; for (j = 1; j <= nrow; ++j) ico[i] += table[j + i * ldtabl]; } /* sort marginals */ isort(&nrow, &iro[1]); isort(&ncol, &ico[1]); /* Determine row and column marginals. Define max(nrow,ncol) =: nco >= nro := min(nrow,ncol) nco is defined above Swap marginals if necessary to ico[1:nco] & iro[1:nro] */ if (nr_gt_nc) { nro = ncol; /* Swap marginals */ for (i = 1; i <= nco; ++i) { ii = iro[i]; if (i <= nro) iro[i] = ico[i]; ico[i] = ii; } } else nro = nrow; /* Get multiplers for stack */ kyy[1] = 1; for (i = 1; i < nro; ++i) { /* Hash table multipliers */ if (iro[i] + 1 <= imax / kyy[i]) { kyy[i + 1] = kyy[i] * (iro[i] + 1); j /= kyy[i]; } else { prterr(5, ch_err_5); return; } } /* Check for Maximum product : */ /* original code: if (iro[nro - 1] + 1 > imax / kyy[nro - 1]) */ if (iro[nro] + 1 > imax / kyy[nro]) { prterr(501, ch_err_5); return; } /* Compute log factorials */ fact[0] = 0.; fact[1] = 0.; if(ntot >= 2) fact[2] = log(2.); /* MM: old code assuming log() to be SLOW */ for (i = 3; i <= ntot; i += 2) { fact[i] = fact[i - 1] + log((double) i); j = i + 1; if (j <= ntot) fact[j] = fact[i] + fact[2] + fact[j / 2] - fact[j / 2 - 1]; } /* Compute obs := observed path length */ obs = tol; ntot = 0; for (j = 1; j <= nco; ++j) { dd = 0.; if (nr_gt_nc) { for (i = 1; i <= nro; ++i) { dd += fact[table[j + i * ldtabl]]; ntot += table[j + i * ldtabl]; } } else { for (i = 1, ii = j * ldtabl + 1; i <= nro; i++, ii++) { dd += fact[table[ii]]; ntot += table[ii]; } } obs += fact[ico[j]] - dd; } /* Denominator of observed table: DRO */ dro = f9xact(nro, ntot, &iro[1], fact); /* improve: the following "easily" underflows to zero -- return "log()" */ *prt = exp(obs - dro); *pre = 0.; itop = 0; maybe_chisq = (*expect > 0.); /* Initialize pointers for workspace */ /* f3xact */ i31 = 1; i32 = i31 + nco; i33 = i32 + nco; i34 = i33 + nco; i35 = i34 + nco; i36 = i35 + nco; i37 = i36 + nco; i38 = i37 + nco; i39 = i38 + 400; i310 = 1; i311 = 1 + 400; /* f4xact */ i = nrow + ncol + 1; i41 = 1; i42 = i41 + i; i43 = i42 + i; i44 = i43 + i; i45 = i44 + i; i46 = i45 + i; i47 = i46 + i * nco; i48 = 1; /* Initialize pointers */ k = nco; last = *ldkey + 1; jkey = *ldkey + 1; jstp = *ldstp + 1; jstp2 = *ldstp * 3 + 1; jstp3 = (*ldstp << 2) + 1; jstp4 = *ldstp * 5 + 1; ikkey = 0; ikstp = 0; ikstp2 = *ldstp << 1; ipo = 1; ipoin[1] = 1; stp[1] = 0.; ifrq[1] = 1; ifrq[ikstp2 + 1] = -1; Outer_Loop: kb = nco - k + 1; ks = 0; n = ico[kb]; kd = nro + 1; kmax = nro; /* IDIF is the difference in going to the daughter */ for (i = 1; i <= nro; ++i) idif[i] = 0; /* Generate the first daughter */ do { --kd; ntot = imin2(n, iro[kd]); idif[kd] = ntot; if (idif[kmax] == 0) --kmax; n -= ntot; } while (n > 0 && kd != 1); if (n != 0) /* i.e. kd == 1 */ goto L310; k1 = k - 1; n = ico[kb]; ntot = 0; for (i = kb + 1; i <= nco; ++i) ntot += ico[i]; L150: /* Arc to daughter length=ICO[KB] */ for (i = 1; i <= nro; ++i) irn[i] = iro[i] - idif[i]; if (k1 > 1) { /* Sort irn */ if (nro == 2) { if (irn[1] > irn[2]) { ii = irn[1]; irn[1] = irn[2]; irn[2] = ii; } } else isort(&nro, &irn[1]); /* Adjust start for zero */ for (i = 1; i <= nro; ++i) { if (irn[i] != 0) break; } nrb = i; } else nrb = 1; nro2 = nro - nrb + 1; /* Some table values */ ddf = f9xact(nro, n, &idif[1], fact); drn = f9xact(nro2, ntot, &irn[nrb], fact) - dro + ddf; /* Get hash value */ if (k1 > 1) { kval = irn[1]; /* Note that with the corrected check at error "502", * we won't have overflow in kval below : */ for (i = 2; i <= nro; ++i) kval += irn[i] * kyy[i]; /* Get hash table entry */ i = kval % (*ldkey << 1) + 1; /* Search for unused location */ for (itp = i; itp <= *ldkey << 1; ++itp) { ii = key2[itp]; if (ii == kval) { goto L240; } else if (ii < 0) { key2[itp] = kval; LP[itp] = 1.; SP[itp] = 1.; goto L240; } } for (itp = 1; itp <= i - 1; ++itp) { ii = key2[itp]; if (ii == kval) goto L240; else if (ii < 0) { key2[itp] = kval; LP[itp] = 1.; goto L240; } } /* KH prterr(6, "LDKEY is too small.\n" "It is not possible to give the value of LDKEY required,\n" "but you could try doubling LDKEY (and possibly LDSTP)."); */ prterr(6, "LDKEY is too small for this problem.\n" "Try increasing the size of the workspace."); } L240: psh = TRUE; /* Recover pastp */ ipn = ipoin[ipo + ikkey]; pastp = stp[ipn + ikstp]; ifreq = ifrq[ipn + ikstp]; /* Compute shortest and longest path */ if (k1 > 1) { obs2 = obs - fact[ico[kb + 1]] - fact[ico[kb + 2]] - ddf; for (i = 3; i <= k1; ++i) obs2 -= fact[ico[kb + i]]; if (LP[itp] > 0.) { dspt = obs - obs2 - ddf; /* Compute longest path */ LP[itp] = f3xact(nro2, &irn[nrb], k1, &ico[kb + 1], ntot, fact, &iwk[i31], &iwk[i32], &iwk[i33], &iwk[i34], &iwk[i35], &iwk[i36], &iwk[i37], &iwk[i38], &iwk[i39], &rwk[i310], &rwk[i311], &tol); if(LP[itp] > 0.) {/* can this happen? */ printf("___ LP[itp=%d] = %g > 0\n", itp, LP[itp]); LP[itp] = 0.; } /* Compute shortest path -- using dspt as offset */ SP[itp] = f4xact(nro2, &irn[nrb], k1, &ico[kb + 1], dspt, fact, &iwk[i47], &iwk[i41], &iwk[i42], &iwk[i43], &iwk[i44], &iwk[i45], &iwk[i46], &rwk[i48], &tol); /* SP[itp] = fmin2(0., SP[itp] - dspt);*/ if(SP[itp] > 0.) { /* can this happen? */ printf("___ SP[itp=%d] = %g > 0\n", itp, SP[itp]); SP[itp] = 0.; } /* Use chi-squared approximation? */ if (maybe_chisq && (irn[nrb] * ico[kb + 1]) > ntot * *emin) { ncell = 0.; for (i = 0; i < nro2; ++i) for (j = 1; j <= k1; ++j) if (irn[nrb + i] * ico[kb + j] >= ntot * *expect) ncell++; if (ncell * 100 >= k1 * nro2 * *percnt) { tmp = 0.; for (i = 0; i < nro2; ++i) tmp += (fact[irn[nrb + i]] - fact[irn[nrb + i] - 1]); tmp *= k1 - 1; for (j = 1; j <= k1; ++j) tmp += (nro2 - 1) * (fact[ico[kb + j]] - fact[ico[kb + j] - 1]); df = (double) ((nro2 - 1) * (k1 - 1)); tmp += df * 1.83787706640934548356065947281; tmp -= (nro2 * k1 - 1) * (fact[ntot] - fact[ntot - 1]); tm[itp] = (obs - dro) * -2. - tmp; } else { /* tm[itp] set to a flag value */ tm[itp] = -9876.; } } else tm[itp] = -9876.; } obs3 = obs2 - LP[itp]; obs2 -= SP[itp]; if (tm[itp] == -9876.) chisq = FALSE; else { chisq = TRUE; tmp = tm[itp]; } } else { obs2 = obs - drn - dro; obs3 = obs2; } L300: /* Process node with new PASTP */ if (pastp <= obs3) /* Update pre */ *pre += (double) ifreq * exp(pastp + drn); else if (pastp < obs2) { if (chisq) { df = (double) ((nro2 - 1) * (k1 - 1)); d1 = fmax2(0., tmp + (pastp + drn) * 2.) / 2.; d2 = df / 2.; pv = 1. - gammds(&d1, &d2, &ifault); *pre += (double) ifreq * exp(pastp + drn) * pv; } else { /* Put daughter on queue */ d1 = pastp + ddf; f5xact(&d1, &tol, &kval, &key[jkey], ldkey, &ipoin[jkey], &stp[jstp], ldstp, &ifrq[jstp], &ifrq[jstp2], &ifrq[jstp3], &ifrq[jstp4], &ifreq, &itop, psh); psh = FALSE; } } /* Get next PASTP on chain */ ipn = ifrq[ipn + ikstp2]; if (ipn > 0) { pastp = stp[ipn + ikstp]; ifreq = ifrq[ipn + ikstp]; goto L300; } /* Generate a new daughter node */ f7xact(kmax, &iro[1], &idif[1], &kd, &ks, &iflag); if (iflag != 1) goto L150; L310: /* Go get a new mother from stage K */ do { if(!f6xact(nro, &iro[1], &kyy[1], &key[ikkey + 1], ldkey, &last, &ipo)) /* Update pointers */ goto Outer_Loop; /* else : no additional nodes to process */ --k; itop = 0; ikkey = jkey - 1; ikstp = jstp - 1; ikstp2 = jstp2 - 1; jkey = *ldkey - jkey + 2; jstp = *ldstp - jstp + 2; jstp2 = (*ldstp << 1) + jstp; for (i = 1; i <= *ldkey << 1; ++i) key2[i] = -9999; } while (k >= 2); }/* f2xact() */ static double f3xact(int nrow, int *irow, int ncol, int *icol, int ntot, double *fact, int *ico, int *iro, int *it, int *lb, int *nr, int *nt, int *nu, int *itc, int *ist, double *stv, double *alen, const double *tol) { /* ----------------------------------------------------------------------- Name: F3XACT Purpose: Computes the longest path length for a given table. Arguments: NROW - The number of rows in the table. (Input) IROW - Vector of length NROW containing the row sums for the table. (Input) NCOL - The number of columns in the table. (Input) ICOL - Vector of length K containing the column sums for the table. (Input) NTOT - The total count in the table. (Input) FACT - Vector containing the logarithms of factorials. (Input) ICO - Work vector of length MAX(NROW,NCOL). IRO - Work vector of length MAX(NROW,NCOL). IT - Work vector of length MAX(NROW,NCOL). LB - Work vector of length MAX(NROW,NCOL). NR - Work vector of length MAX(NROW,NCOL). NT - Work vector of length MAX(NROW,NCOL). NU - Work vector of length MAX(NROW,NCOL). ITC - Work vector of length 400. IST - Work vector of length 400. STV - Work vector of length 400. ALEN - Work vector of length MAX(NROW,NCOL). TOL - Tolerance. (Input) Return Value : LP - The longest path for the table. (Output) ----------------------------------------------------------------------- */ const int ldst = 200;/* half stack size */ /* Initialized data */ static int nst = 0; static int nitc = 0; int i, k; int n11, n12, ii, nn, ks, ic1, ic2, nc1, nn1; int nr1, nco, nct, ipn, irl, key, lev, itp, nro, nrt, kyy, nc1s; double LP, v, val, vmn; Rboolean xmin; --stv; --ist; --itc; --nu; --nt; --nr; --lb; --it; --iro; --ico; --icol; --irow; if (nrow <= 1) { /* nrow is 1 */ LP = 0.; if (nrow > 0) for (i = 1; i <= ncol; ++i) LP -= fact[icol[i]]; return LP; } if (ncol <= 1) { /* ncol is 1 */ LP = 0.; if (ncol > 0) { for (i = 1; i <= nrow; ++i) LP -= fact[irow[i]]; } return LP; } /* 2 by 2 table */ if (nrow * ncol == 4) { n11 = (irow[1] + 1) * (icol[1] + 1) / (ntot + 2); n12 = irow[1] - n11; return -(fact[n11] + fact[n12] + fact[icol[1] - n11] + fact[icol[2] - n12]); } /* ELSE: larger than 2 x 2 : */ /* Test for optimal table */ val = 0.; if (irow[nrow] <= irow[1] + ncol) xmin = f10act(nrow, &irow[1], ncol, &icol[1], &val, fact, &lb[1], &nu[1], &nr[1]); else xmin = FALSE; if (! xmin && icol[ncol] <= icol[1] + nrow) xmin = f10act(ncol, &icol[1], nrow, &irow[1], &val, fact, &lb[1], &nu[1], &nr[1]); if (xmin) return - val; /* Setup for dynamic programming */ for (i = 0; i <= ncol; ++i) alen[i] = 0.; for (i = 1; i <= 2*ldst; ++i) ist[i] = -1; nn = ntot; /* Minimize ncol */ if (nrow >= ncol) { nro = nrow; nco = ncol; ico[1] = icol[1]; nt[1] = nn - ico[1]; for (i = 2; i <= ncol; ++i) { ico[i] = icol[i]; nt[i] = nt[i - 1] - ico[i]; } for (i = 1; i <= nrow; ++i) iro[i] = irow[i]; } else { nro = ncol; nco = nrow; ico[1] = irow[1]; nt[1] = nn - ico[1]; for (i = 2; i <= nrow; ++i) { ico[i] = irow[i]; nt[i] = nt[i - 1] - ico[i]; } for (i = 1; i <= ncol; ++i) iro[i] = icol[i]; } nc1s = nco - 1; kyy = ico[nco] + 1; /* Initialize pointers */ vmn = 1e100;/* to contain min(v..) */ irl = 1; ks = 0; k = ldst; LnewNode: /* Setup to generate new node */ lev = 1; nr1 = nro - 1; nrt = iro[irl]; nct = ico[1]; lb[1] = (int) ((((double) nrt + 1) * (nct + 1)) / (double) (nn + nr1 * nc1s + 1) - *tol) - 1; nu[1] = (int) ((((double) nrt + nc1s) * (nct + nr1)) / (double) (nn + nr1 + nc1s)) - lb[1] + 1; nr[1] = nrt - lb[1]; LoopNode: /* Generate a node */ --nu[lev]; if (nu[lev] == 0) { if (lev == 1) goto L200; --lev; goto LoopNode; } ++lb[lev]; --nr[lev]; while(1) { alen[lev] = alen[lev - 1] + fact[lb[lev]]; if (lev >= nc1s) break; nn1 = nt[lev]; nrt = nr[lev]; ++lev; nc1 = nco - lev; nct = ico[lev]; lb[lev] = (int) ((((double) nrt + 1) * (nct + 1)) / (double) (nn1 + nr1 * nc1 + 1) - *tol); nu[lev] = (int) ((((double) nrt + nc1) * (nct + nr1)) / (double) (nn1 + nr1 + nc1) - lb[lev] + 1); nr[lev] = nrt - lb[lev]; } alen[nco] = alen[lev] + fact[nr[lev]]; lb[nco] = nr[lev]; v = val + alen[nco]; if (nro == 2) { /* Only 1 row left */ v += fact[ico[1] - lb[1]] + fact[ico[2] - lb[2]]; for (i = 3; i <= nco; ++i) v += fact[ico[i] - lb[i]]; if (v < vmn) vmn = v; } else if (nro == 3 && nco == 2) { /* 3 rows and 2 columns */ nn1 = nn - iro[irl] + 2; ic1 = ico[1] - lb[1]; ic2 = ico[2] - lb[2]; n11 = (iro[irl + 1] + 1) * (ic1 + 1) / nn1; n12 = iro[irl + 1] - n11; v += fact[n11] + fact[n12] + fact[ic1 - n11] + fact[ic2 - n12]; if (v < vmn) vmn = v; } else { /* Column marginals are new node */ for (i = 1; i <= nco; ++i) it[i] = imax2(ico[i] - lb[i], 0); /* Sort column marginals it[] : */ if (nco == 2) { if (it[1] > it[2]) { /* swap */ ii = it[1]; it[1] = it[2]; it[2] = ii; } } else isort(&nco, &it[1]); /* Compute hash value */ key = it[1] * kyy + it[2]; for (i = 3; i <= nco; ++i) { key = it[i] + key * kyy; } if (key < -1){ if (apop_opts.verbose) printf("Bug in FEXACT: gave negative key.\n"); return -1; } /* Table index */ ipn = key % ldst + 1; /* Find empty position */ for (itp = ipn, ii = ks + ipn; itp <= ldst; ++itp, ++ii) { if (ist[ii] < 0) { goto L180; } else if (ist[ii] == key) { goto L190; } } for (itp = 1, ii = ks + 1; itp <= ipn - 1; ++itp, ++ii) { if (ist[ii] < 0) { goto L180; } else if (ist[ii] == key) { goto L190; } } /* this happens less, now that we check for negative key above: */ Apop_stopif(1, has_error=true; return GSL_NAN, 0, "Stack length exceeded in f3xact. This problem should not occur."); L180: /* Push onto stack */ ist[ii] = key; stv[ii] = v; ++nst; ii = nst + ks; itc[ii] = itp; goto LoopNode; L190: /* Marginals already on stack */ stv[ii] = fmin2(v, stv[ii]); } goto LoopNode; L200: /* Pop item from stack */ if (nitc > 0) { /* Stack index */ itp = itc[nitc + k] + k; --nitc; val = stv[itp]; key = ist[itp]; ist[itp] = -1; /* Compute marginals */ for (i = nco; i >= 2; --i) { ico[i] = key % kyy; key /= kyy; } ico[1] = key; /* Set up nt array */ nt[1] = nn - ico[1]; for (i = 2; i <= nco; ++i) nt[i] = nt[i - 1] - ico[i]; /* Test for optimality (L90) */ if (iro[nro] <= iro[irl] + nco) { xmin = f10act(nro, &iro[irl], nco, &ico[1], &val, fact, &lb[1], &nu[1], &nr[1]); } else xmin = FALSE; if (!xmin && ico[nco] <= ico[1] + nro) xmin = f10act(nco, &ico[1], nro, &iro[irl], &val, fact, &lb[1], &nu[1], &nr[1]); if (xmin) { if (vmn > val) vmn = val; goto L200; } else goto LnewNode; } else if (nro > 2 && nst > 0) { /* Go to next level */ nitc = nst; nst = 0; k = ks; ks = ldst - ks; nn -= iro[irl]; ++irl; --nro; goto L200; } return - vmn; } static double f4xact(int nrow, int *irow, int ncol, int *icol, double dspt, double *fact, int *icstk, int *ncstk, int *lstk, int *mstk, int *nstk, int *nrstk, int *irstk, double *ystk, const double *tol) { /* ----------------------------------------------------------------------- Name: F4XACT Purpose: Computes the shortest path length for a given table. Arguments: NROW - The number of rows in the table. (Input) IROW - Vector of length NROW containing the row sums for the table. (Input) NCOL - The number of columns in the table. (Input) ICOL - Vector of length K containing the column sums for the table. (Input) DSPT - "offset" for SP computation FACT - Vector containing the logarithms of factorials. (Input) ICSTK - NCOL by NROW+NCOL+1 work array. NCSTK - Work vector of length NROW+NCOL+1. LSTK - Work vector of length NROW+NCOL+1. MSTK - Work vector of length NROW+NCOL+1. NSTK - Work vector of length NROW+NCOL+1. NRSTK - Work vector of length NROW+NCOL+1. IRSTK - NROW by MAX(NROW,NCOL) work array. YSTK - Work vector of length NROW+NCOL+1. TOL - Tolerance. (Input) Return Value : SP - The shortest path for the table. (Output) ----------------------------------------------------------------------- */ int i, j, k, l, m, n, ic1, ir1, ict, irt, istk, nco, nro; double y, amx, SP; /* Take care of the easy cases first */ if (nrow == 1) { SP = 0.; for (i = 0; i < ncol; ++i) SP -= fact[icol[i]]; return SP; } if (ncol == 1) { SP = 0.; for (i = 0; i < nrow; ++i) SP -= fact[irow[i]]; return SP; } if (nrow * ncol == 4) { if (irow[1] <= icol[1]) return -(fact[irow[1]] + fact[icol[1]] + fact[icol[1] - irow[1]]); else return -(fact[icol[1]] + fact[irow[1]] + fact[irow[1] - icol[1]]); } /* Parameter adjustments */ irstk -= nrow + 1; icstk -= ncol + 1; --nrstk; --ncstk; --lstk; --mstk; --nstk; --ystk; /* initialization before loop */ for (i = 1; i <= nrow; ++i) irstk[i + nrow] = irow[nrow - i]; for (j = 1; j <= ncol; ++j) icstk[j + ncol] = icol[ncol - j]; nro = nrow; nco = ncol; nrstk[1] = nro; ncstk[1] = nco; ystk[1] = 0.; y = 0.; istk = 1; l = 1; amx = 0.; SP = dspt; /* First LOOP */ do { ir1 = irstk[istk * nrow + 1]; ic1 = icstk[istk * ncol + 1]; if (ir1 > ic1) { if (nro >= nco) { m = nco - 1; n = 2; } else { m = nro; n = 1; } } else if (ir1 < ic1) { if (nro <= nco) { m = nro - 1; n = 1; } else { m = nco; n = 2; } } else { if (nro <= nco) { m = nro - 1; n = 1; } else { m = nco - 1; n = 2; } } L60: if (n == 1) { i = l; j = 1; } else { i = 1; j = l; } irt = irstk[i + istk * nrow]; ict = icstk[j + istk * ncol]; y += fact[imin2(irt, ict)]; if (irt == ict) { --nro; --nco; f11act(&irstk[istk * nrow + 1], i, nro, &irstk[(istk + 1) * nrow + 1]); f11act(&icstk[istk * ncol + 1], j, nco, &icstk[(istk + 1) * ncol + 1]); } else if (irt > ict) { --nco; f11act(&icstk[istk * ncol + 1], j, nco, &icstk[(istk + 1) * ncol + 1]); f8xact(&irstk[istk * nrow + 1], irt - ict, i, nro, &irstk[(istk + 1) * nrow + 1]); } else { --nro; f11act(&irstk[istk * nrow + 1], i, nro, &irstk[(istk + 1) * nrow + 1]); f8xact(&icstk[istk * ncol + 1], ict - irt, j, nco, &icstk[(istk + 1) * ncol + 1]); } if (nro == 1) { for (k = 1; k <= nco; ++k) y += fact[icstk[k + (istk + 1) * ncol]]; break; } if (nco == 1) { for (k = 1; k <= nro; ++k) y += fact[irstk[k + (istk + 1) * nrow]]; break; } lstk[istk] = l; mstk[istk] = m; nstk[istk] = n; ++istk; nrstk[istk] = nro; ncstk[istk] = nco; ystk[istk] = y; l = 1; } while(1);/* end do */ if (y > amx) { amx = y; if (SP - amx <= *tol) return -dspt; } do { --istk; if (istk == 0) { SP -= amx; if (SP - amx <= *tol) return -dspt; else return SP - dspt; } l = lstk[istk] + 1; for(;; ++l) { if (l > mstk[istk]) break; n = nstk[istk]; nro = nrstk[istk]; nco = ncstk[istk]; y = ystk[istk]; if (n == 1) { if (irstk[l + istk * nrow] < irstk[l - 1 + istk * nrow]) goto L60; } else if (n == 2) { if (icstk[l + istk * ncol] < icstk[l - 1 + istk * ncol]) goto L60; } } } while(1); } void f5xact(double *pastp, const double *tol, int *kval, int *key, int *ldkey, int *ipoin, double *stp, int *ldstp, int *ifrq, int *npoin, int *nr, int *nl, int *ifreq, int *itop, Rboolean psh) { /* ----------------------------------------------------------------------- Name: F5XACT aka "PUT" Purpose: Put node on stack in network algorithm. Arguments: PASTP - The past path length. (Input) TOL - Tolerance for equivalence of past path lengths. (Input) KVAL - Key value. (Input) KEY - Vector of length LDKEY containing the key values. (in/out) LDKEY - Length of vector KEY. (Input) IPOIN - Vector of length LDKEY pointing to the linked list of past path lengths. (in/out) STP - Vector of length LSDTP containing the linked lists of past path lengths. (in/out) LDSTP - Length of vector STP. (Input) IFRQ - Vector of length LDSTP containing the past path frequencies. (in/out) NPOIN - Vector of length LDSTP containing the pointers to the next past path length. (in/out) NR - Vector of length LDSTP containing the right object pointers in the tree of past path lengths. (in/out) NL - Vector of length LDSTP containing the left object pointers in the tree of past path lengths. (in/out) IFREQ - Frequency of the current path length. (Input) ITOP - Pointer to the top of STP. (Input) PSH - Logical. (Input) If PSH is true, the past path length is found in the table KEY. Otherwise the location of the past path length is assumed known and to have been found in a previous call. ==>>>>> USING "static" variables ----------------------------------------------------------------------- */ static int itmp, ird, ipn, itp; /* << *need* static, see PSH above */ double test1, test2; --nl; --nr; --npoin; --ifrq; --stp; /* Function Body */ if (psh) { /* Convert KVAL to int in range 1, ..., LDKEY. */ ird = *kval % *ldkey; /* Search for an unused location */ for (itp = ird; itp < *ldkey; ++itp) { if (key[itp] == *kval) goto L40; if (key[itp] < 0) goto L30; } for (itp = 0; itp < ird; ++itp) { if (key[itp] == *kval) goto L40; if (key[itp] < 0) goto L30; } /* Return if KEY array is full */ /* KH prterr(6, "LDKEY is too small for this problem.\n" "It is not possible to estimate the value of LDKEY " "required,\n" "but twice the current value may be sufficient."); */ prterr(6, "LDKEY is too small for this problem.\n" "Try increasing the size of the workspace."); L30: /* Update KEY */ key[itp] = *kval; ++(*itop); ipoin[itp] = *itop; /* Return if STP array full */ if (*itop > *ldstp) { /* KH prterr(7, "LDSTP is too small for this problem.\n" "It is not possible to estimate the value of LDSTP " "required,\n" "but twice the current value may be sufficient."); */ prterr(7, "LDSTP is too small for this problem.\n" "Try increasing the size of the workspace."); } /* Update STP, etc. */ npoin[*itop] = -1; nr [*itop] = -1; nl [*itop] = -1; stp [*itop] = *pastp; ifrq [*itop] = *ifreq; return; } L40: /* Find location, if any, of pastp */ ipn = ipoin[itp]; test1 = *pastp - *tol; test2 = *pastp + *tol; do { if (stp[ipn] < test1) ipn = nl[ipn]; else if (stp[ipn] > test2) ipn = nr[ipn]; else { ifrq[ipn] += *ifreq; return; } } while (ipn > 0); /* Return if STP array full */ ++(*itop); if (*itop > *ldstp) { /* prterr(7, "LDSTP is too small for this problem.\n" "It is not possible to estimate the value of LDSTP " "required,\n" "but twice the current value may be sufficient."); */ prterr(7, "LDSTP is too small for this problem.\n" "Try increasing the size of the workspace."); return; } /* Find location to add value */ ipn = ipoin[itp]; itmp = ipn; L60: if (stp[ipn] < test1) { itmp = ipn; ipn = nl[ipn]; if (ipn > 0) goto L60; /* else */ nl[itmp] = *itop; } else if (stp[ipn] > test2) { itmp = ipn; ipn = nr[ipn]; if (ipn > 0) goto L60; /* else */ nr[itmp] = *itop; } /* Update STP, etc. */ npoin[*itop] = npoin[itmp]; npoin[itmp] = *itop; stp [*itop] = *pastp; ifrq [*itop] = *ifreq; nl [*itop] = -1; nr [*itop] = -1; } Rboolean f6xact(int nrow, int *irow, int *kyy, int *key, int *ldkey, int *last, int *ipn) { /* ----------------------------------------------------------------------- Name: F6XACT aka "GET" Purpose: Pop a node off the stack. Arguments: NROW - The number of rows in the table. (Input) IROW - Vector of length nrow containing the row sums on output. (Output) KYY - Constant mutlipliers used in forming the hash table key. (Input) KEY - Vector of length LDKEY containing the hash table keys. (In/out) LDKEY - Length of vector KEY. (Input) LAST - Index of the last key popped off the stack. (In/out) IPN - Pointer to the linked list of past path lengths. (Output) Return value : TRUE if there are no additional nodes to process. (Output) ----------------------------------------------------------------------- */ int kval, j; --key; L10: ++(*last); if (*last <= *ldkey) { if (key[*last] < 0) goto L10; /* Get KVAL from the stack */ kval = key[*last]; key[*last] = -9999; for (j = nrow-1; j > 0; j--) { irow[j] = kval / kyy[j]; kval -= irow[j] * kyy[j]; } irow[0] = kval; *ipn = *last; return FALSE; } else { *last = 0; return TRUE; } } void f7xact(int nrow, int *imax, int *idif, int *k, int *ks, int *iflag) { /* ----------------------------------------------------------------------- Name: F7XACT Purpose: Generate the new nodes for given marginal totals. Arguments: NROW - The number of rows in the table. (Input) IMAX - The row marginal totals. (Input) IDIF - The column counts for the new column. (in/out) K - Indicator for the row to decrement. (in/out) KS - Indicator for the row to increment. (in/out) IFLAG - Status indicator. (Output) If IFLAG is zero, a new table was generated. For IFLAG = 1, no additional tables could be generated. ----------------------------------------------------------------------- */ int i, m, kk, mm; /* Parameter adjustments */ --idif; --imax; /* Function Body */ *iflag = 0; /* Find node which can be incremented, ks */ if (*ks == 0) do { ++(*ks); } while (idif[*ks] == imax[*ks]); /* Find node to decrement (>ks) */ if (idif[*k] > 0 && *k > *ks) { --idif[*k]; do { --(*k); } while(imax[*k] == 0); m = *k; /* Find node to increment (>=ks) */ while (idif[m] >= imax[m]) { --m; } ++idif[m]; /* Change ks */ if (m == *ks && idif[m] == imax[m]) *ks = *k; } else { Loop: /* Check for finish */ for (kk = *k + 1; kk <= nrow; ++kk) { if (idif[kk] > 0) { goto L70; } } *iflag = 1; return; L70: /* Reallocate counts */ mm = 1; for (i = 1; i <= *k; ++i) { mm += idif[i]; idif[i] = 0; } *k = kk; do { --(*k); m = imin2(mm, imax[*k]); idif[*k] = m; mm -= m; } while (mm > 0 && *k != 1); /* Check that all counts reallocated */ if (mm > 0) { if (kk != nrow) { *k = kk; goto Loop; } *iflag = 1; return; } /* Get ks */ --idif[kk]; *ks = 0; do { ++(*ks); if (*ks > *k) { return; } } while (idif[*ks] >= imax[*ks]); } } void f8xact(int *irow, int is, int i1, int izero, int *new) { /* ----------------------------------------------------------------------- Name: F8XACT Purpose: Routine for reducing a vector when there is a zero element. Arguments: IROW - Vector containing the row counts. (Input) IS - Indicator. (Input) I1 - Indicator. (Input) IZERO - Position of the zero. (Input) NEW - Vector of new row counts. (Output) ----------------------------------------------------------------------- */ int i; /* Parameter adjustments */ --new; --irow; /* Function Body */ for (i = 1; i < i1; ++i) new[i] = irow[i]; for (i = i1; i <= izero - 1; ++i) { if (is >= irow[i + 1]) break; new[i] = irow[i + 1]; } new[i] = is; for(;;) { ++i; if (i > izero) return; new[i] = irow[i]; } } static double f9xact(int n, int ntot, int *ir, double *fact) { /* ----------------------------------------------------------------------- Name: F9XACT Purpose: Computes the log of a multinomial coefficient. Arguments: N - Length of IR. (Input) NTOT - Number for factorial in numerator. (Input) IR - Vector of length N containing the numbers for the denominator of the factorial. (Input) FACT - Table of log factorials. (Input) Returns: - The log of the multinomal coefficient. (Output) ----------------------------------------------------------------------- */ double d = fact[ntot]; for (int k = 0; k < n; k++) d -= fact[ir[k]]; return d; } Rboolean f10act(int nrow, int *irow, int ncol, int *icol, double *val, double *fact, int *nd, int *ne, int *m) { /* ----------------------------------------------------------------------- Name: F10ACT Purpose: Computes the shortest path length for special tables. Arguments: NROW - The number of rows in the table. (Input) IROW - Vector of length NROW containing the row totals. (Input) NCOL - The number of columns in the table. (Input) ICO - Vector of length NCOL containing the column totals.(Input) VAL - The shortest path. (Input/Output) FACT - Vector containing the logarithms of factorials. (Input) ND - Workspace vector of length NROW. (Input) NE - Workspace vector of length NCOL. (Input) M - Workspace vector of length NCOL. (Input) Returns (VAL and): XMIN - Set to true if shortest path obtained. (Output) ----------------------------------------------------------------------- */ int i, is, ix; for (i = 0; i < nrow - 1; ++i) nd[i] = 0; is = icol[0] / nrow; ix = icol[0] - nrow * is; ne[0] = is; m[0] = ix; if (ix != 0) ++nd[ix-1]; for (i = 1; i < ncol; ++i) { ix = icol[i] / nrow; ne[i] = ix; is += ix; ix = icol[i] - nrow * ix; m[i] = ix; if (ix != 0) ++nd[ix-1]; } for (i = nrow - 3; i >= 0; --i) nd[i] += nd[i + 1]; ix = 0; for (i = nrow; i >= 2; --i) { ix += is + nd[nrow - i] - irow[i-1]; if (ix < 0) return FALSE; } for (i = 0; i < ncol; ++i) { ix = ne[i]; is = m[i]; *val += is * fact[ix + 1] + (nrow - is) * fact[ix]; } return TRUE; } void f11act(int *irow, int i1, int i2, int *new) { /* ----------------------------------------------------------------------- Name: F11ACT Purpose: Routine for revising row totals. Arguments: IROW - Vector containing the row totals. (Input) I1 - Indicator. (Input) I2 - Indicator. (Input) NEW - Vector containing the row totals. (Output) ----------------------------------------------------------------------- */ int i; for (i = 0; i < (i1 - 1); ++i) new[i] = irow[i]; for (i = i1; i <= i2; ++i) new[i-1] = irow[i]; } static int iwork(int iwkmax, int *iwkpt, int number, int itype) { /* ----------------------------------------------------------------------- Name: iwork Purpose: Routine for allocating workspace. Arguments: iwkmax - Maximum (int) amount of workspace. (Input) iwkpt - Amount of (int) workspace currently allocated. (in/out) number - Number of elements of workspace desired. (Input) itype - Workspace type. (Input) ITYPE TYPE 2 integer 3 float 4 double iwork(): Index in rwrk, dwrk, or iwrk of the beginning of the first free element in the workspace array. (Output) ----------------------------------------------------------------------- */ int i = *iwkpt; if (itype == 2 || itype == 3) *iwkpt += number; else { /* double */ if (i % 2 != 0) ++i; *iwkpt += (number << 1); i /= 2; } Apop_stopif(*iwkpt >iwkmax, has_error=true;return i, 0, "Out of workspace: %i > %i", *iwkpt, iwkmax); return i; } #ifndef USING_R void isort(int *n, int *ix) { /* ----------------------------------------------------------------------- Name: ISORT Purpose: Shell sort for an int vector. Arguments: N - Lenth of vector IX. (Input) IX - Vector to be sorted. (in/out) ----------------------------------------------------------------------- */ static int ikey, i, j, m, il[10], kl, it, iu[10], ku; /* Parameter adjustments */ --ix; /* Function Body */ m = 1; i = 1; j = *n; L10: if (i >= j) goto L40; kl = i; ku = j; ikey = i; ++j; /* Find element in first half */ L20: ++i; if (i < j) if (ix[ikey] > ix[i]) goto L20; /* Find element in second half */ L30: --j; if (ix[j] > ix[ikey]) { goto L30; } /* Interchange */ if (i < j) { it = ix[i]; ix[i] = ix[j]; ix[j] = it; goto L20; } it = ix[ikey]; ix[ikey] = ix[j]; ix[j] = it; /* Save upper and lower subscripts of the array yet to be sorted */ if (m < 11) { if (j - kl < ku - j) { il[m - 1] = j + 1; iu[m - 1] = ku; i = kl; --j; } else { il[m - 1] = kl; iu[m - 1] = j - 1; i = j + 1; j = ku; } ++m; goto L10; } else Apop_stopif(1, return, 0, "This should never occur."); /* Use another segment */ L40: --m; if (m == 0) return; i = il[m - 1]; j = iu[m - 1]; goto L10; } static double gammds(double *y, double *p, int *ifault) { /* ----------------------------------------------------------------------- Name: GAMMDS Purpose: Cumulative distribution for the gamma distribution. Usage: PGAMMA (Q, ALPHA,IFAULT) Arguments: Q - Value at which the distribution is desired. (Input) ALPHA - Parameter in the gamma distribution. (Input) IFAULT - Error indicator. (Output) IFAULT DEFINITION 0 No error 1 An argument is misspecified. 2 A numerical error has occurred. PGAMMA - The cdf for the gamma distribution with parameter alpha evaluated at Q. (Output) ----------------------------------------------------------------------- Algorithm AS 147 APPL. Statist. (1980) VOL. 29, P. 113 Computes the incomplete gamma integral for positive parameters Y, P using and infinite series. */ static double a, c, f, g; /* Checks for the admissibility of arguments and value of F */ *ifault = 1; g = 0.; if (*y <= 0. || *p <= 0.) return g; *ifault = 2; /* ALOGAM is natural log of gamma function no need to test ifail as an error is impossible BK edit: using gsl_sf_lngamma instead. It has more methods--> maybe slower; more precise. */ a = *p + 1.; f = exp(*p * log(*y) - gsl_sf_lngamma(a) - *y); if (f == 0.) return g; *ifault = 0; /* Series begins */ c = 1.; g = 1.; a = *p; do { a += 1.; c *= (*y / a); g += c; } while (c > 1e-6 * g); g *= f; return g; } /** Convert from an \ref apop_data set to a table of integers. Not too necessary, but I needed it for the Fisher exact test. */ static int *apop_data_to_int_array(apop_data *intab){ int rowct = intab->matrix->size1, colct = intab->matrix->size2, *out = malloc(sizeof(int)*(rowct* colct)); for (int i=0; i< rowct; i++) for (int j=0; j< colct; j++) out[j*rowct + i] = (int) gsl_matrix_get(intab->matrix, i, j); return out; } /** Run the Fisher exact test on an input contingency table. \return An \ref apop_data set with two rows:
"probability of table": Probability of the observed table for fixed marginal totals.
"p value": Table p-value. The probability of a more extreme table, where `extreme' is in a probabilistic sense. \li If there are processing errors, these values will be NaN. \exception out->error=='p' Processing error in the test. For example: \include test_fisher.c */ apop_data *apop_test_fisher_exact(apop_data *intab){ double prt, pre, expect = -1, percent = 80, emin = 1; int *intified = apop_data_to_int_array(intab), workspace = 200000, mult = 30, rowct = intab->matrix->size1, colct = intab->matrix->size2; has_error=0; OMP_critical (fexact) //f3xact and f5exact use static vars for some state-keeping. fexact(&rowct, &colct, intified, &rowct, // Cochran condition for asym.chisq. decision: &expect, &percent, &emin, &prt, &pre, &workspace, &mult); free(intified); apop_data *out = apop_data_alloc(); Asprintf(&out->names->title, "Fisher Exact test"); apop_data_add_named_elmt(out, "probability of table", prt); apop_data_add_named_elmt(out, "p value", pre); Apop_stopif(has_error, out->error='p'; return out, 0, "processing error; don't trust the results."); return out; } #endif /* not USING_R */ apophenia-1.0+ds/apop_hist.c000066400000000000000000000413321262736346100160670ustar00rootroot00000000000000 /** \file apop_hist.c */ /* Functions that work with PMFs and histograms. Copyright (c) 2006--2007, 2010, 2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. (Except psmirnov2x, Copyright R Project, but also licensed under the GPL.) */ #include "apop_internal.h" #include #include #include /** Make random draws from an \ref apop_model, and bin them using a binspec in the style of \ref apop_data_to_bins. If you have a data set that used the same binspec, you now have synced histograms, which you can plot or sensibly test hypotheses about. \param binspec A description of the bins in which to place the draws; see \ref apop_data_to_bins. (default: as in \ref apop_data_to_bins.) \param model The model to be drawn from. Because this function works via random draws, the model needs to have a \c draw method. (No default) \param draws The number of random draws to make. (arbitrary default = 10,000) \param bin_count If no bin spec, the number of bins to use (default: as per \ref apop_data_to_bins, \f$\sqrt(N)\f$) \return An \ref apop_pmf model, with a new binned data set attached (which you may have to apop_data_free(output_model->data) to prevent memory leaks). The weights on the data set are normalized to sum to one. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_model * apop_model_to_pmf(apop_model *model, apop_data *binspec, long int draws, int bin_count){ #else apop_varad_head(apop_model *, apop_model_to_pmf){ apop_model* apop_varad_var(model, NULL); Apop_assert(model && model->draw, "The second argument needs to be an apop_model with a 'draw' function " "that I can use to make random draws."); apop_data* apop_varad_var(binspec, NULL); int apop_varad_var(bin_count, 0); long int apop_varad_var(draws, 1e4); return apop_model_to_pmf_base(model, binspec, draws, bin_count); } apop_model * apop_model_to_pmf_base(apop_model *model, apop_data *binspec, long int draws, int bin_count){ #endif Get_vmsizes(binspec); apop_data *outd = apop_model_draws(model, draws); apop_data *outbinned = apop_data_to_bins(outd, binspec, .bin_count=bin_count); apop_data_free(outd); apop_vector_normalize(outbinned->weights); return apop_estimate(outbinned, apop_pmf); } /** Test the goodness-of-fit between two \ref apop_pmf models. Let \f$o_i\f$ be the \f$i\f$th observed bin and \f$e_i\f$ the expected value of that bin; then under typical assumptions, $\f$\Sum_i^N (o_i-e_i)^2/e_i \sim \Chi^2_{N-1}\f$. If you send two histograms, I assume that the histograms are synced: for PMFs, you've used \ref apop_data_to_bins to generate two histograms using the same binspec, or you've used \ref apop_data_pmf_compress to guarantee that each observation value appears exactly once in each data set. In any case, all values in the \c observed set must appear in the \c expected set with nonzero weight; otherwise this will return a \f$\chi^2\f$ statistic of \c GSL_POSINF, indicating that it is impossible for the \c observed data to have been drawn from the \c expected distribution. \li If an observation row has weight zero, I skip it. if apop_opts.verbose >=1 I will show a warning. */ apop_data *apop_histograms_test_goodness_of_fit(apop_model *observed, apop_model *expected){ int df = observed->data->weights->size; double diff = 0; for (int i=0; i< observed->data->weights->size; i++){ double obs_val = gsl_vector_get(observed->data->weights, i); double exp_val = apop_p(Apop_r(observed->data, i), expected); if (exp_val == 0){ diff = GSL_POSINF; break; } if (obs_val==0){ Apop_notify(1, "element %i of the observed data has weight zero. Skipping it.", i); df --; } else diff += gsl_pow_2(obs_val - exp_val)/exp_val; } //Data gathered. Now output apop_data *out = apop_data_alloc(); double toptail = gsl_cdf_chisq_Q(diff, df-1); Asprintf(&out->names->title, "Goodness-of-fit test via Chi-squared statistic"); apop_data_add_named_elmt(out, "Chi squared statistic", diff); apop_data_add_named_elmt(out, "df", df-1); apop_data_add_named_elmt(out, "p value", toptail); apop_data_add_named_elmt(out, "confidence", 1 - toptail); return out; } /*Everything from here to psmirnov2x (inclusive) is cut/pasted/trivially modified from the R project. Copyright them. */ static void m_multiply(long double *A, long double *B, long double *C, int m) { /* Auxiliary routine used by K(). Matrix multiplication. */ for (int i = 0; i < m; i++) for (int j = 0; j < m; j++) { long double s = 0; for (int k = 0; k < m; k++) s += A[i * m + k] * B[k * m + j]; C[i * m + j] = s; } } static void m_power(long double *A, int eA, long double *V, int *eV, int m, int n) { /* Auxiliary routine used by K(). Matrix power. */ long double *B; int eB, i; if (n == 1) { for (i = 0; i < m * m; i++) V[i] = A[i]; *eV = eA; return; } m_power(A, eA, V, eV, m, n / 2); B = calloc(m * m, sizeof(long double)); m_multiply(V, V, B, m); eB = 2 * (*eV); if ((n % 2) == 0) { for (i = 0; i < m * m; i++) V[i] = B[i]; *eV = eB; } else { m_multiply(A, B, V, m); *eV = eA + eB; } if (V[(m / 2) * m + (m / 2)] > 1e140) { for (i = 0; i < m * m; i++) V[i] = V[i] * 1e-140; *eV += 140; } free(B); } /* The two-sided one-sample 'exact' distribution */ static double kolmogorov_2x(int n, double d) { /* Compute Kolmogorov's distribution. Code published in George Marsaglia and Wai Wan Tsang and Jingbo Wang (2003), "Evaluating Kolmogorov's distribution". Journal of Statistical Software, Volume 8, 2003, Issue 18. URL: http://www.jstatsoft.org/v08/i18/. */ int k, m, i, j, g, eH, eQ; long double h, s, *H, *Q; /* The faster right-tail approximation is omitted here. s = d*d*n; if(s > 7.24 || (s > 3.76 && n > 99)) return 1-2*exp(-(2.000071+.331/sqrt(n)+1.409/n)*s); */ k = (n * d) + 1; m = 2 * k - 1; h = k - n * d; H = calloc(m * m, sizeof(long double)); Q = calloc(m * m, sizeof(long double)); for(i = 0; i < m; i++) for(j = 0; j < m; j++) if (i - j + 1 < 0) H[i * m + j] = 0; else H[i * m + j] = 1; for(i = 0; i < m; i++) { H[i * m] -= pow(h, i + 1); H[(m - 1) * m + i] -= pow(h, (m - i)); } H[(m - 1) * m] += ((2 * h - 1 > 0) ? pow(2 * h - 1, m) : 0); for(i = 0; i < m; i++) for(j=0; j < m; j++) if(i - j + 1 > 0) for(g = 1; g <= i - j + 1; g++) H[i * m + j] /= g; eH = 0; m_power(H, eH, Q, &eQ, m, n); s = Q[(k - 1) * m + k - 1]; for(i = 1; i <= n; i++) { s = s * i / n; if(s < 1e-140) { s *= 1e140; eQ -= 140; } } s *= pow(10., eQ); free(H); free(Q); return(s); } static double psmirnov2x(double x, int m, int n) { if(m > n) { int tmp = n; n = m; m = tmp; } double md = m; double nd = n; // q has 0.5/mn added to ensure that rounding error doesn't // turn an equality into an inequality, eg abs(1/2-4/5)>3/10 long double q = (.5+floor(x * md * nd - 1e-7)) / (md * nd); long double u[n+1]; for(int j = 0; j <= n; j++) u[j] = ((j / nd) > q) ? 0 : 1; for(int i = 1; i <= m; i++) { long double w = i/(i + nd); u[0] = (i / md) > q ? 0 : w * u[0]; for(int j = 1; j <= n; j++) u[j] = fabs(i / md - j / nd) > q ? 0 : w * u[j] + u[j - 1]; } return u[n]; } /** Run the Kolmogorov-Smirnov test to determine whether two distributions are identical. \param m1 A sorted PMF model. I.e., a model estimated via something like apop_model *m1 = apop_estimate(apop_data_sort(input_data), apop_pmf); \param m2 Another \ref apop_model. If it is a PMF, then I will use a two-sample test, which is different from the one-sample test used if this is not a PMF. \return An \ref apop_data set including the \f$p\f$-value from the Kolmogorov-Smirnov test that the two distributions are equal. \exception out->error='m' Model error: \c m1 is not an \ref apop_pmf. I verify this by checking whether m1->cdf == apop_pmf->cdf. \li If you are using a \ref apop_pmf model, the data set(s) must be sorted before you set up the model, as per the example below. See \ref apop_data_sort and the discussion of CDFs in the \ref apop_pmf documentation. If you don't do this, the test will almost certainly reject the null hypothesis that \c m1 and \c m2 are identical. A future version of Apophenia may implement a mechanism to allow this function to test for sorted data, but it currently can't. Here is an example, which tests whether a set of draws from a Normal(0, 1) matches a sequence of Normal distributions with increasing mean. \include ks_tests.c */ apop_data *apop_test_kolmogorov(apop_model *m1, apop_model *m2){ Apop_stopif(m1->cdf != apop_pmf->cdf, apop_return_data_error('m'), 0, "First model has to be a PMF. I check whether m1->cdf == apop_pmf->cdf."); bool m2_is_pmf = (m2->cdf == apop_pmf->cdf); int maxsize1, maxsize2; {Get_vmsizes(m1->data); maxsize1 = maxsize;} //copy one of the macro's variables {Get_vmsizes(m2->data); maxsize2 = maxsize;} //to the full function's scope. double largest_diff = GSL_NEGINF; double sum = 0; for (size_t i=0; i< maxsize1; i++){ apop_data *arow = Apop_r(m1->data, i); sum += m1->data->weights ? gsl_vector_get(m1->data->weights, i) : 1./maxsize1; largest_diff = GSL_MAX(largest_diff, fabs(sum-apop_cdf(arow, m2))); } if (m2_is_pmf){ double sum = 0; for (size_t i=0; i< maxsize2; i++){ //There could be matched data rows to m1, so there is redundancy. apop_data *arow = Apop_r(m2->data, i); // Feel free to submit a smarter version. sum += m2->data->weights ? gsl_vector_get(m2->data->weights, i) : 1./maxsize2; largest_diff = GSL_MAX(largest_diff, fabs(sum-apop_cdf(arow, m2))); } } apop_data *out = apop_data_alloc(); Asprintf(&out->names->title, "Kolmogorov-Smirnov test"); apop_data_add_named_elmt(out, "max distance", largest_diff); double ps = m2_is_pmf ? psmirnov2x(largest_diff, maxsize1, maxsize2) : kolmogorov_2x(maxsize1, largest_diff); apop_data_add_named_elmt(out, "p value, 2 tail", 1-ps); apop_data_add_named_elmt(out, "confidence, 2 tail", ps); return out; } /** Create a histogram from data by putting data into bins of fixed width. Your input \ref apop_data set may be multidimensional, and may include both vector and matrix parts, and the bins output will have corresponding dimension. \param indata The input data that will be binned, one observation per row. This is copied and the copy will be modified. (No default) \param binspec This is an \ref apop_data set with the same number of columns as \c indata. If you want a fixed size for the bins, then the first row of the bin spec is the bin width for each column. This allows you to specify a width for each dimension, or specify the same size for all with something like: \code apop_data *binspec = apop_data_copy(Apop_r(indata, 0)); gsl_matrix_set_all(binspec->matrix, 10); //bins of size 10 for all dim.s apop_data_to_bins(indata, binspec); \endcode The presumption is that the first bin starts at zero in all cases. You can add a second row to the spec to give the offset for each dimension. (default: NULL) \param bin_count If you don't provide a bin spec, I'll provide this many evenly-sized bins to cover the data set. (Default: \f$\sqrt{N}\f$) \param close_top_bin Normally, a bin covers the range from the point equal to its minimum to points strictly less than the minimum plus the width. if \c 'y', then the top bin includes points less than or equal to the upper bound. This solves the problem of displaying histograms where the top bin is just one point. (default: \c 'y' if \c binspec==NULL, else \c 'n') \return A pointer to an \ref apop_data set with the same dimension as your input data. Each cell is an integer giving the bin number into which the cell falls. \li If no binspec and no binlist, then a grid with offset equal to the min of the column, and bin size such that it takes \f$\sqrt{N}\f$ bins to cover the range to the max element. \li The text segment is not binned. The \c more pointer, if any, is not followed. \li If there are weights, they are copied to the output via \ref apop_vector_copy. \li Given \c NULL input, return \c NULL output. Print a warning if apop_opts.verbose >= 2. Iff you didn't give me a binspec, then I attach one to the output set as a page named \c \. This means that you can snap a second data set to the same grid using \code apop_data_to_bins(first_set, NULL); apop_data_to_bins(second_set, apop_data_get_page(first_set, "")); \endcode \li The output has exactly as many rows as the input. Because many rows will be identical after binning, it may be fruitful to run it through \ref apop_data_pmf_compress to produce a short list with one total weight per bin. Here is a sample program highlighting \ref apop_data_to_bins and \ref apop_data_pmf_compress . \include binning.c \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_to_bins(apop_data const *indata, apop_data const *binspec, int bin_count, char close_top_bin){ #else apop_varad_head(apop_data *, apop_data_to_bins){ apop_data const *apop_varad_var(indata, NULL); Apop_assert_c(indata, NULL, 2, "NULL input data set, so returning NULL output data set."); apop_data const *apop_varad_var(binspec, NULL); char apop_varad_var(close_top_bin, binspec==NULL ? 'y' : 'n'); int apop_varad_var(bin_count, 0); return apop_data_to_bins_base(indata, binspec, bin_count, close_top_bin); } apop_data * apop_data_to_bins_base(apop_data const *indata, apop_data const *binspec, int bin_count, char close_top_bin){ #endif Get_vmsizes(indata); //firstcol, vsize, msize1, msize2 double binwidth, offset, max=0; apop_data *out = apop_data_alloc(vsize, msize1, msize2); apop_data const *bs = binspec ? binspec : apop_data_add_page(out, apop_data_alloc(vsize? 2: 0, msize1? 2: 0, indata->matrix ? msize2: 0), ""); for (int j= firstcol; j< msize2; j++){ gsl_vector *onecol = Apop_cv(out, j); gsl_vector *datacol = Apop_cv(indata, j); if (binspec){ binwidth = apop_data_get(binspec, 0, j); offset = ((binspec->vector && binspec->vector->size==2 ) ||(binspec->matrix && binspec->matrix->size1==2)) ? apop_data_get(binspec, 1, j) : 0; } else { gsl_vector *abin = Apop_cv(bs, j); max = gsl_vector_max(datacol); offset = abin->data[1] = gsl_vector_min(datacol); binwidth = abin->data[0] = (max - offset)/(bin_count ? bin_count : sqrt(datacol->size)); } for (int i=0; i< onecol->size; i++){ double val = gsl_vector_get(datacol, i); double adjust = (close_top_bin=='y' && val == max && val!=offset) ? 2*GSL_DBL_EPSILON : 0; gsl_vector_set(onecol, i, (floor((val-offset-adjust)/binwidth))*binwidth+offset); } } if (indata->weights) out->weights = apop_vector_copy(indata->weights); return out; } /** Return a new vector that is the moving average of the input vector. \param v The input vector, unsmoothed \param bandwidth An integer \f$\geq 1\f$ giving the number of elements to be averaged to produce one number. \return A smoothed vector of size v->size - (bandwidth/2)*2. */ gsl_vector *apop_vector_moving_average(gsl_vector *v, size_t bandwidth){ Apop_stopif(!v, return NULL, 0, "You asked me to smooth a NULL vector; returning NULL."); Apop_stopif(!bandwidth, return apop_vector_copy(v), 0, "Bandwidth must be >=1. Returning a copy of original vector with no smoothing."); int halfspan = bandwidth/2; Apop_stopif((v->size - halfspan*2)<=0, return NULL, 0, "Bandwidth wider than the vector. Returning NULL."); gsl_vector *vout = gsl_vector_calloc(v->size - halfspan*2); for(size_t i=0; i < vout->size; i ++){ double *item = gsl_vector_ptr(vout, i); for (int j=-halfspan; j < halfspan+1; j ++) *item += gsl_vector_get(v, j+ i+ halfspan); *item /= halfspan*2 +1; } return vout; } apophenia-1.0+ds/apop_internal.h000066400000000000000000000073721262736346100167470ustar00rootroot00000000000000/* These are functions used here and there to write Apophenia. They're not incredibly useful, or even very good form, so they're not public. Cut & paste `em into your own code if you'd like. */ /* Many Apop functions try to treat the vector and matrix equally, which requires knowing which exists and what the sizes are. */ #define Get_vmsizes(d) \ int firstcol = d && (d)->vector ? -1 : 0; \ int vsize = d && (d)->vector ? (d)->vector->size : 0; \ int wsize = d && (d)->weights ? (d)->weights->size : 0; \ int msize1 = d && (d)->matrix ? (d)->matrix->size1 : 0; \ int msize2 = d && (d)->matrix ? (d)->matrix->size2 : 0; \ int tsize = vsize + msize1*msize2; \ int maxsize = GSL_MAX(vsize, GSL_MAX(msize1, d?d->textsize[0]:0));\ (void)(tsize||wsize||firstcol||maxsize) /*prevent unused variable complaints */; // Define a static variable, and initialize on first use. #define Staticdef(type, name, def) static type (name) = NULL; if (!(name)) (name) = (def); // Check for NULL and complain if so. #define Nullcheck(in, errval) Apop_assert_c(in, errval, apop_errorlevel, "%s is NULL.", #in); #define Nullcheck_m(in, errval) Apop_assert_c(in, errval, apop_errorlevel, "%s is a NULL model.", #in); #define Nullcheck_mp(in, errval) Nullcheck_m(in, errval); Apop_assert_c((in)->parameters, errval, apop_errorlevel, "%s is a model with NULL parameters. Please set the parameters and try again.", #in); #define Nullcheck_d(in, errval) Apop_assert_c(in, errval, apop_errorlevel, "%s is a NULL data set.", #in); //And because I do them all so often: #define Nullcheck_mpd(data, model, errval) Nullcheck_m(model, errval); Nullcheck_p(model, errval); Nullcheck_d(data, errval); //deprecated: #define Nullcheck_p(in, errval) Nullcheck_mp(in, errval); //in apop_conversions.c Extend a string. void xprintf(char **q, char *format, ...); #define XN(in) ((in) ? (in) : "") //For a pedantic compiler. Continues on error, because there's not much else to do: the computer is clearly broken. #define Asprintf(...) Apop_stopif(asprintf(__VA_ARGS__)==-1, , 0, "Error printing to a string.") #include #include int apop_use_sqlite_prepared_statements(size_t col_ct); int apop_prepare_prepared_statements(char const *tabname, size_t col_ct, sqlite3_stmt **statement); char *prep_string_for_sqlite(int prepped_statements, char const *astring);//apop_conversions.c void apop_gsl_error(char const *reason, char const *file, int line, int gsl_errno); //apop_linear_algebra.c //For when we're forced to use a global variable. #undef threadlocal #if __STDC_VERSION__ > 201100L #define threadlocal _Thread_local #elif defined(__APPLE__) #define threadlocal #elif defined(__GNUC__) && !defined(threadlocal) #define threadlocal __thread #else #define threadlocal #endif #ifdef _OPENMP #define PRAGMA(x) _Pragma(#x) #define OMP_critical(tag) PRAGMA(omp critical ( tag )) #define OMP_for(...) _Pragma("omp parallel for") for(__VA_ARGS__) #define OMP_for_reduce(red, ...) PRAGMA(omp parallel for reduction( red )) for(__VA_ARGS__) #else #define OMP_critical(tag) #define OMP_for(...) for(__VA_ARGS__) #define OMP_for_reduce(red, ...) for(__VA_ARGS__) #endif #include "config.h" #ifndef HAVE___ATTRIBUTE__ #define __attribute__(...) #endif #ifndef HAVE_ASPRINTF #include //asprintf, vararg, &c extern int asprintf (char **res, const char *format, ...) __attribute__ ((__format__ (__printf__, 2, 3))); extern int vasprintf (char **res, const char *format, va_list args) __attribute__ ((__format__ (__printf__, 2, 0))); #endif #include "apop.h" void add_info_criteria(apop_data *d, apop_model *m, apop_model *est, double ll, int param_ct); //In apop_mle.c apop_model *maybe_prep(apop_data *d, apop_model *m, _Bool *is_a_copy); //in apop_mcmc, for apop_update. apophenia-1.0+ds/apop_linear_algebra.c000066400000000000000000000545021262736346100200520ustar00rootroot00000000000000 /** \file apop_linear_algebra.c Assorted things to do with matrices, such as take determinants or do singular value decompositions. Includes many convenience functions that don't actually do math but add/delete columns, check bounds, et cetera. */ /* Copyright (c) 2006--2007, 2012 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" void apop_gsl_error(const char *reason, const char *file, int line, int gsl_errno){ Apop_notify(1, "%s: %s", file, reason); Apop_maybe_abort(1); } #define Checkgsl(...) if (__VA_ARGS__) {goto done;} #define Check_gsl_with_out(...) if (__VA_ARGS__) {out->error='m'; goto done;} #define Check_gsl_with_outmp(...) if (__VA_ARGS__) {gsl_matrix_free(*out); *out=NULL; goto done;} #define Set_gsl_handler gsl_error_handler_t *prior_handler = gsl_set_error_handler(apop_gsl_error); #define Unset_gsl_handler gsl_set_error_handler(prior_handler); /** Calculate the determinant of a matrix, its inverse, or both, via LU decomposition. The \c in matrix is not destroyed in the process. \see apop_matrix_determinant, apop_matrix_inverse \param in The matrix to be inverted/determined. \param out If you want an inverse, this is where to place the matrix to be filled with the inverse. Will be allocated by the function. \param calc_det 0: Do not calculate the determinant.
1: Do. \param calc_inv 0: Do not calculate the inverse.
1: Do. \return If calc_det == 1, then return the determinant. Otherwise, just returns zero. If calc_inv!=0, then \c *out is pointed to the matrix inverse. In case of difficulty, I will set *out=NULL and return \c NaN. */ double apop_det_and_inv(const gsl_matrix *in, gsl_matrix **out, int calc_det, int calc_inv) { Set_gsl_handler Apop_stopif(in->size1 != in->size2, *out=NULL; return GSL_NAN, 0, "You asked me to invert a %zu X %zu matrix, " "but inversion requires a square matrix.", in->size1, in->size2); int sign; double the_determinant = GSL_NAN; gsl_matrix *invert_me = gsl_matrix_alloc(in->size1, in->size1); gsl_permutation * perm = gsl_permutation_alloc(in->size1); gsl_matrix_memcpy (invert_me, in); Checkgsl(gsl_linalg_LU_decomp(invert_me, perm, &sign)) if (calc_inv){ *out = gsl_matrix_alloc(in->size1, in->size1); //square. Check_gsl_with_outmp(gsl_linalg_LU_invert(invert_me, perm, *out)) } if (calc_det) the_determinant = gsl_linalg_LU_det(invert_me, sign); done: gsl_matrix_free(invert_me); gsl_permutation_free(perm); Unset_gsl_handler return the_determinant; } /** Inverts a matrix. The \c in matrix is not destroyed in the process. You may want to call \ref apop_matrix_determinant first to check that your input is invertible, or use \ref apop_det_and_inv to do both at once. \param in The matrix to be inverted. \return Its inverse. */ gsl_matrix * apop_matrix_inverse(const gsl_matrix *in) { gsl_matrix *out = NULL; apop_det_and_inv(in, &out, 0, 1); return out; } /** Find the determinant of a matrix. The \c in matrix is not destroyed in the process. See also \ref apop_matrix_inverse , or \ref apop_det_and_inv to do both at once. \param in The matrix to be determined. \return The determinant. */ double apop_matrix_determinant(const gsl_matrix *in) { return apop_det_and_inv(in, NULL, 1, 0); } /** Principal component analysis: hand in a matrix and (optionally) a number of desired dimensions, and I'll return a data set where each column of the matrix is an eigenvector. The columns are sorted, so column zero has the greatest weight. The vector element of the data set gives the weights. You may also specify the number of elements your principal component space should have. If this is equal to the rank of the space in which the input data lives, then the sum of weights will be one. If the dimensions desired is less than that (probably so you can prepare a plot), then the weights will be accordingly smaller, giving you an indication of how much variation these dimensions explain. \param data The input matrix. I modify int in place so that each column has mean zero. (No default. If \c NULL, return \c NULL and print a warning iff apop_opts.verbose >= 1.) \param dimensions_we_want The singular value decomposition will return this many of the eigenvectors with the largest eigenvalues. (default: the size of the covariance matrix, i.e. data->size2) \return Returns an \ref apop_data set whose matrix is the principal component space. Each column of the returned matrix will be another eigenvector; the columns will be ordered by the eigenvalues. The data set's vector will be the largest eigenvalues, scaled by the total of all eigenvalues (including those that were thrown out). The sum of these returned values will give you the percentage of variance explained by the factor analysis. \exception out->error=='a' Allocation error. */ #ifdef APOP_NO_VARIADIC apop_data * apop_matrix_pca(gsl_matrix *data, int const dimensions_we_want){ #else apop_varad_head(apop_data *, apop_matrix_pca) { gsl_matrix * apop_varad_var(data, NULL); Apop_stopif(!data, return NULL, 1, "NULL data input"); int const apop_varad_var(dimensions_we_want, data->size2); return apop_matrix_pca_base(data, dimensions_we_want); } apop_data * apop_matrix_pca_base(gsl_matrix *data, int const dimensions_we_want){ #endif Set_gsl_handler apop_data *pc_space = apop_data_alloc(0, data->size2, dimensions_we_want); Apop_stopif(pc_space->error, return pc_space, 0, "Allocation error."); pc_space->vector = gsl_vector_alloc(dimensions_we_want); Apop_stopif(!pc_space->vector, pc_space->error='a'; return pc_space, 0, "Allocation error setting up a %i vector.", dimensions_we_want); gsl_matrix *eigenvectors = gsl_matrix_alloc(data->size2, data->size2); gsl_vector *dummy_v = gsl_vector_alloc(data->size2); gsl_vector *all_evalues = gsl_vector_alloc(data->size2); gsl_matrix *square = gsl_matrix_calloc(data->size2, data->size2); Apop_stopif(!eigenvectors || !dummy_v || !all_evalues || !square, pc_space->error='a'; return pc_space, 0, "Allocation error setting up workspace for %zu dimensions.", data->size2); double eigentotals = 0; for (int i=0; i< data->size2; i++) apop_vector_normalize(Apop_mcv(data, i), NULL, 'm'); Checkgsl(gsl_blas_dgemm(CblasTrans,CblasNoTrans, 1, data, data, 0, square)) Checkgsl(gsl_linalg_SV_decomp(square, eigenvectors, all_evalues, dummy_v)) for (int i=0; i< all_evalues->size; i++) eigentotals += gsl_vector_get(all_evalues, i); for (int i=0; imatrix, i, v); gsl_vector_set(pc_space->vector, i, gsl_vector_get(all_evalues, i)/eigentotals); } done: gsl_vector_free(dummy_v); gsl_vector_free(all_evalues); gsl_matrix_free(square); gsl_matrix_free(eigenvectors); Unset_gsl_handler return pc_space; } static void l10(double *d){ *d = log10(*d); } static void ln(double *d){ *d = log(*d); } static void ex(double *d){ *d = exp(*d); } /** Replace every vector element \f$v_i\f$ with log\f$_{10}(v_i)\f$. \li If the input vector is \c NULL, do nothing. */ void apop_vector_log10(gsl_vector *v){ if (!v) return; apop_vector_apply(v, l10); } /** Replace every vector element \f$v_i\f$ with ln\f$(v_i)\f$. \li If the input vector is \c NULL, do nothing. */ void apop_vector_log(gsl_vector *v){ if (!v) return; apop_vector_apply(v, ln); } /** Replace every vector element \f$v_i\f$ with exp\f$(v_i)\f$. \li If the input vector is \c NULL, do nothing. */ void apop_vector_exp(gsl_vector *v){ if (!v) return; apop_vector_apply(v, ex); } /** Put the first vector on top of the second vector. \param v1 the upper vector (default=\c NULL, in which case this copies \c v2) \param v2 the second vector (default=\c NULL, in which case nothing is added) \param inplace If \c 'y', use \ref apop_vector_realloc to modify \c v1 in place; see the caveats on that function. Otherwise, allocate a new vector, leaving \c v1 undisturbed. (default=\c 'n') \return the stacked data, either in a new vector or a pointer to \c v1. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC gsl_vector * apop_vector_stack(gsl_vector *v1, gsl_vector const * v2, char inplace){ #else apop_varad_head(gsl_vector *, apop_vector_stack){ gsl_vector * apop_varad_var(v1, NULL); gsl_vector const * apop_varad_var(v2, NULL); char apop_varad_var(inplace, 'n'); return apop_vector_stack_base(v1, v2, inplace); } gsl_vector * apop_vector_stack_base(gsl_vector *v1, gsl_vector const * v2, char inplace){ #endif gsl_vector *out; gsl_vector t; if (!v1 && v2){ out = gsl_vector_alloc(v2->size); gsl_vector_memcpy(out, v2); return out; } else if (!v2 && v1){ if (inplace == 'y') return v1; out = gsl_vector_alloc(v1->size); gsl_vector_memcpy(out, v1); return out; } else if (!v1 && !v2) return NULL; //else: size_t v1size = v1->size; //save in case of reallocing. if (inplace == 'y' ) out = apop_vector_realloc(v1, v1->size+v2->size); else { out = gsl_vector_alloc(v1->size + v2->size); t = gsl_vector_subvector(out, 0, v1size).vector; gsl_vector_memcpy(&t, v1); } t = gsl_vector_subvector(out, v1size, v2->size).vector; gsl_vector_memcpy(&t, v2); return out; } /** Put the first matrix either on top of or to the right of the second matrix. Returns a new matrix, meaning that at the end of this function, until you \c gsl_matrix_free() the original matrices, you will be taking up twice as much memory. Plan accordingly. \param m1 the upper/rightmost matrix (default: \c NULL, in which case this copies \c m2) \param m2 the second matrix (default: \c NULL, in which case \c m1 is returned) \param posn If \c 'r', stack rows on top of other rows. If \c 'c' stack columns next to columns. (default: \c 'r') \param inplace If \c 'y', use \ref apop_matrix_realloc to modify \c m1 in place; see the caveats on that function. Otherwise, allocate a new matrix, leaving \c m1 undisturbed. (default: \c 'n') \return the stacked data, either in a new matrix or a pointer to \c m1. For example, here is a function to merge four matrices into a single two-part-by-two-part matrix. The original matrices are unchanged. \code gsl_matrix *apop_stack_two_by_two(gsl_matrix *ul, gsl_matrix *ur, gsl_matrix *dl, gsl_matrix *dr){ gsl_matrix *output, *t; output = apop_matrix_stack(ul, ur, 'c'); t = apop_matrix_stack(dl, dr, 'c'); apop_matrix_stack(output, t, 'r', .inplace='y'); gsl_matrix_free(t); return output; } \endcode \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC gsl_matrix * apop_matrix_stack(gsl_matrix *m1, gsl_matrix const * m2, char posn, char inplace){ #else apop_varad_head(gsl_matrix *, apop_matrix_stack){ gsl_matrix *apop_varad_var(m1, NULL); gsl_matrix const *apop_varad_var(m2, NULL); char apop_varad_var(posn, 'r'); char apop_varad_var(inplace, 'n'); return apop_matrix_stack_base(m1, m2, posn, inplace); } gsl_matrix * apop_matrix_stack_base(gsl_matrix *m1, gsl_matrix const * m2, char posn, char inplace){ #endif gsl_matrix *out; gsl_vector_view tmp_vector; if (!m1 && m2){ out = gsl_matrix_alloc(m2->size1, m2->size2); gsl_matrix_memcpy(out, m2); return out; } else if (!m2 && m1) { if (inplace =='y') return m1; out = gsl_matrix_alloc(m1->size1, m1->size2); gsl_matrix_memcpy(out, m1); return out; } else if (!m2 && !m1) return NULL; if (posn == 'r'){ Apop_stopif(m1->size2 != m2->size2, return NULL, 0, "When stacking matrices on top of each other, they have to have the same number of columns, but m1->size2==%zu and m2->size2==%zu. Returning NULL.", m1->size2, m2->size2); int m1size = m1->size1; if (inplace =='y') out = apop_matrix_realloc(m1, m1->size1 + m2->size1, m1->size2); else { out = gsl_matrix_alloc(m1->size1 + m2->size1, m1->size2); for (int i=0; i< m1size; i++){ tmp_vector = gsl_matrix_row(m1, i); gsl_matrix_set_row(out, i, &(tmp_vector.vector)); } } for (int i=m1size; i< m1size + m2->size1; i++){ gsl_vector_const_view tmp_vector = gsl_matrix_const_row(m2, i- m1size); gsl_matrix_set_row(out, i, &(tmp_vector.vector)); } return out; } else { Apop_stopif(m1->size1 != m2->size1, return NULL, 0, "When stacking matrices side by side, " "they have to have the same number of rows, but m1->size1==%zu and m2->size1==%zu. Returning NULL." , m1->size1, m2->size1); int m1size = m1->size2; if (inplace =='y') out = apop_matrix_realloc(m1, m1->size1, m1->size2 + m2->size2); else { out = gsl_matrix_alloc(m1->size1, m1->size2 + m2->size2); for (int i=0; i< m1size; i++) gsl_matrix_set_col(out, i, Apop_mcv(m1, i)); } for (int i=0; i< m2->size2; i++) gsl_matrix_set_col(out, i+ m1size, Apop_mcv((gsl_matrix*)m2, i)); return out; } } /** Test that all elements of a vector are within bounds, so you can preempt a procedure that is about to break on infinite or too-large values. \param in A gsl_vector \param max An upper and lower bound to the elements of the vector. (default: INFINITY) \return 1 if everything is bounded: not Inf, -Inf, or NaN, and \f$-\max < x < \max\f$;
0 otherwise. \li A \c NULL vector has no unbounded elements, so \c NULL input returns 1. You get a warning if apop_opts.verbosity >=2. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC int apop_vector_bounded(const gsl_vector *in, long double max){ #else apop_varad_head(int, apop_vector_bounded){ const gsl_vector * apop_varad_var(in, NULL) Apop_stopif(!in, return 1, 2, "You sent in a NULL vector; returning 1."); long double apop_varad_var(max, INFINITY) return apop_vector_bounded_base(in, max); } int apop_vector_bounded_base(const gsl_vector *in, long double max){ #endif for (size_t i=0; i< in->size; i++){ double x = gsl_vector_get(in, i); if (!gsl_finite(x) || x> max || x< -max) return 0; } return 1; } static gsl_vector* dot_for_apop_dot(const gsl_matrix *m, const gsl_vector *v, const CBLAS_TRANSPOSE_t flip){ #define Check_gslv(...) if (__VA_ARGS__) {gsl_vector_free(out); out=NULL;} gsl_vector *out = (flip ==CblasNoTrans) ? gsl_vector_calloc(m->size1) : gsl_vector_calloc(m->size2); Check_gslv(gsl_blas_dgemv (flip, 1.0, m, v, 0.0, out)) return out; } /** A convenience function for dot products, which requires less prep and typing than the gsl_cblas_dgexx functions. It makes use of the semi-overloading of the \ref apop_data structure. \c d1 may be a vector or a matrix, and the same for \c d2, so this function can do vector dot matrix, matrix dot matrix, and so on. If \c d1 includes both a vector and a matrix, then later parameters will indicate which to use. \param d1 the left part of \f$ d1 \cdot d2\f$ \param d2 the right part of \f$ d1 \cdot d2\f$ \param form1 't' or 'p': transpose or prime \c d1->matrix, or, if \c d1->matrix is \c NULL, read \c d1->vector as a row vector.
'n' or 0: use matrix if present; no transpose. (the default)
'v': ignore the matrix and use the vector. \param form2 As above, with \c d2. \return an \ref apop_data set. If two matrices come in, the vector element is \c NULL and the matrix has the dot product; if either or both are vectors, the vector has the output and the matrix is \c NULL. \exception out->error='a' Allocation error. \exception out->error='d' dimension-matching error. \exception out->error='m' GSL math error. \exception NULL If you ask me to take the dot product of NULL, I return NULL. \li Some systems auto-transpose non-conforming matrices. You input a \f$3 \times 5\f$ and a \f$3 \times 5\f$ matrix, and the system assumes that you meant to transpose the second, producing a \f$(3 \times 5) \cdot (5 \times 3) \rightarrow (3 \times 3)\f$ output. Apophenia does not do this. First, it's ambiguous whether the output should be \f$3 \times 3\f$ or \f$5 \times 5\f$. Second, your next run might have three observations, and two \f$3 \times 3\f$ matrices don't require transposition; auto-transposition thus creates situations where bugs can pop up on only some iterations of a loop. \li For a vector \f$\cdot\f$ a matrix, the vector is always treated as a row vector, meaning that a \f$(3\times 1)\f$ dot a \f$(3\times 4)\f$ matrix is correct, and produces a \f$(1 \times 4)\f$ vector. For a matrix \f$\cdot\f$ a vector, the vector is always treated as a column vector. Requests for transposing the vector are ignored in both cases. \li As a corrollary to the above rule, a vector dot a vector always produces a scalar, which will be put in the zeroth element of the output vector; see the example. \li If you want to multiply an \f$N \times 1\f$ vector \f$\cdot\f$ a \f$1 \times N\f$ vector to produce an \f$N \times N\f$ matrix, then use \ref apop_vector_to_matrix to turn your vectors into matrices; see the example. \li A note for readers of Modeling with Data: the awkward instructions on using this function on p 130 are now obsolete, thanks to the designated initializer syntax for function calls. Notably, in the case where d1 is a vector and d2 a matrix, then apop_dot(d1,d2,'t') won't work, because 't' now refers to d1. Instead use apop_dot(d1,d2,.form2='t') or apop_dot(d1,d2,0, 't') \li This function uses the \ref designated syntax for inputs. Sample code: \include dot_products.c */ #ifdef APOP_NO_VARIADIC apop_data * apop_dot(const apop_data *d1, const apop_data *d2, char form1, char form2){ #else apop_varad_head(apop_data *, apop_dot){ const apop_data * apop_varad_var(d1, NULL) const apop_data * apop_varad_var(d2, NULL) Apop_stopif(!d1, return NULL, 1, "d1 is NULL; returning NULL"); Apop_stopif(!d2, return NULL, 1, "d2 is NULL; returning NULL"); char apop_varad_var(form1, 0) char apop_varad_var(form2, 0) return apop_dot_base(d1, d2, form1, form2); } apop_data * apop_dot_base(const apop_data *d1, const apop_data *d2, char form1, char form2){ #endif Set_gsl_handler int uselm, userm; gsl_matrix *lm = d1->matrix, *rm = d2->matrix; gsl_vector *lv = d1->vector, *rv = d2->vector; if (d1->matrix && form1 != 'v') uselm = 1; else if (d1->vector) uselm = 0; else { Apop_stopif(form1 == 'v', return NULL, 0, "You asked for a vector from the left data set, but " "its vector==NULL. Returning NULL."); Apop_stopif(1, return NULL, 0, "The left data set has neither non-NULL " "matrix nor vector. Returning NULL."); } if (d2->matrix && form2 != 'v') userm = 1; else if (d2->vector) userm = 0; else { Apop_stopif(form2 == 'v', return NULL, 0, "You asked for a vector from the right data set, but " "its vector==NULL. Returning NULL."); Apop_stopif(1, return NULL, 0, "The right data set has neither non-NULL " "matrix nor vector. Returning NULL."); } apop_data *out = apop_data_alloc(); #define Dimcheck(lr, lc, rr, rc) Apop_stopif((lc)!=(rr), out->error='d'; goto done,\ 0, "mismatched dimensions: %zuX%zu dot %zuX%zu. %s", (lr), (lc), (rr), (rc),\ ((lr)==(rr)) ? " Maybe transpose the first?" \ : ((rc)==(lc)) ? " Maybe transpose the second?" : ""); CBLAS_TRANSPOSE_t lt, rt; lt = (form1 == 'p' || form1 == 't' || form1 == 1) ? CblasTrans: CblasNoTrans; rt = (form2 == 'p' || form2 == 't' || form2 == 1) ? CblasTrans: CblasNoTrans; if (uselm && userm){ Dimcheck((lt== CblasNoTrans) ? lm->size1:lm->size2, (lt== CblasNoTrans) ? lm->size2:lm->size1, (rt== CblasNoTrans) ? rm->size1:rm->size2, (rt== CblasNoTrans) ? rm->size2:rm->size1) gsl_matrix *outm = gsl_matrix_calloc((lt== CblasTrans)? lm->size2: lm->size1, (rt== CblasTrans)? rm->size1: rm->size2); Check_gsl_with_out(gsl_blas_dgemm (lt,rt, 1, lm, rm, 0, outm)) out->matrix = outm; } else if (!uselm && userm){ Dimcheck((size_t)1, lv->size, (rt== CblasNoTrans) ? rm->size1:rm->size2, (rt== CblasNoTrans) ? rm->size2:rm->size1) //dgemv is always matrix first, then vector, so reverse from vm to mv: // if output vector has dimension matrix->size2, send CblasTrans // if output vector has dimension matrix->size1, send CblasNoTrans out->vector = dot_for_apop_dot(rm, lv , (rt == CblasNoTrans) ? CblasTrans : CblasNoTrans); Apop_stopif(!out->vector, out->error='m'; goto done, 0, "GSL-level math error"); } else if (uselm && !userm){ Dimcheck((lt== CblasNoTrans) ? lm->size1:lm->size2, (lt== CblasNoTrans) ? lm->size2:lm->size1, rv->size , (size_t)1) out->vector = dot_for_apop_dot(lm, rv , lt); Apop_stopif(!out->vector, out->error='m'; goto done, 0, "GSL-level math error"); } else if (!uselm && !userm){ double outd; Check_gsl_with_out(gsl_blas_ddot(lv, rv, &outd)) out->vector = gsl_vector_alloc(1); gsl_vector_set(out->vector, 0, outd); } //If using the vector, there's no meaningful name to assign. if (d1->names && uselm){ if (lt == CblasTrans) apop_name_stack(out->names, d1->names, 'r', 'c'); else apop_name_stack(out->names, d1->names, 'r'); } if (d2->names && userm){ if (rt == CblasTrans) apop_name_stack(out->names, d2->names, 'c', 'r'); else apop_name_stack(out->names, d2->names, 'c'); } done: Unset_gsl_handler return out; } apophenia-1.0+ds/apop_linear_constraint.c000066400000000000000000000215531262736346100206410ustar00rootroot00000000000000 /** \file apop_linear_constraint.c \c apop_linear_constraint finds a point that meets a set of linear constraints. This takes a lot of machinery, so it gets its own file. Copyright (c) 2007, 2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" static double magnitude(gsl_vector *v){ double out; gsl_blas_ddot(v, v, &out); return out; } static void find_nearest_point(gsl_vector *V, double k, gsl_vector *B, gsl_vector *out){ /* Find X such that BX =K and there is an S such that X + SB=V. */ double S=0; //S = (BV-K)/B'B. gsl_blas_ddot(B, V, &S); S -= k; assert(!gsl_isnan(S)); S /= magnitude(B); assert(!gsl_isnan(S)); gsl_vector_memcpy(out, B); //X = -SB +V gsl_vector_scale(out, -S); gsl_vector_add(out, V); assert(!gsl_isnan(gsl_vector_get(out,0))); } static int binds(gsl_vector const *v, double k, gsl_vector const *b, double margin){ double d; gsl_blas_ddot(v, b, &d); return d < k + margin; } static double trig_bit(gsl_vector *dimv, gsl_vector *otherv, double off_by){ double theta, costheta, dot, out; gsl_blas_ddot(dimv, otherv, &dot); costheta = dot/(magnitude(dimv)*magnitude(otherv)); theta = acos(costheta); out = off_by/gsl_pow_2(sin(theta)); return out; } /* The hard part is when your candidate point does not satisfy other constraints, so you need to translate the point until it meets the new hypersurface. How far is that? Project beta onto the new surface, and find the distance between that projection and the original surface. Then translate beta toward the original surface by that amount. The projection of the translated beta onto the new surface now also touches the old surface. */ static void get_candiate(gsl_vector *beta, apop_data *constraint, int current, gsl_vector *candidate, double margin){ double k, ck, off_by, s; gsl_vector *pseudobeta = NULL; gsl_vector *pseudocandidate = NULL; gsl_vector *pseudocandidate2 = NULL; gsl_vector *fix = NULL; gsl_vector *cc = Apop_rv(constraint, current); ck = gsl_vector_get(constraint->vector, current); find_nearest_point(beta, ck, cc, candidate); for (size_t i=0; i< constraint->vector->size; i++){ if (i!=current){ gsl_vector *other = Apop_rv(constraint, i); k =apop_data_get(constraint, i, -1); if (binds(candidate, k, other, margin)){ if (!pseudobeta){ pseudobeta = gsl_vector_alloc(beta->size); gsl_vector_memcpy(pseudobeta, beta); pseudocandidate = gsl_vector_alloc(beta->size); pseudocandidate2 = gsl_vector_alloc(beta->size); fix = gsl_vector_alloc(beta->size); } find_nearest_point(pseudobeta, k, other, pseudocandidate); find_nearest_point(pseudocandidate, ck, cc, pseudocandidate2); off_by = apop_vector_distance(pseudocandidate, pseudocandidate2); s = trig_bit(cc, other, off_by); gsl_vector_memcpy(fix, cc); gsl_vector_scale(fix, magnitude(cc)); gsl_vector_scale(fix, s); gsl_vector_add(pseudobeta, fix); find_nearest_point(pseudobeta, k, other, candidate); gsl_vector_memcpy(pseudobeta, candidate); } } } if (fix){ gsl_vector_free(fix); gsl_vector_free(pseudobeta); gsl_vector_free(pseudocandidate); gsl_vector_free(pseudocandidate2); } } /** This is designed to be called from within the constraint method of your \ref apop_model. Just write the constraint vector+matrix and this will do the rest. See \ref constr for detailed discussion. \param beta The proposed vector about to be tested. No default, must not be \c NULL. \param constraint A vector/matrix pair [v | m1 m2 ... mn] where each row is interpreted as a less-than inequality: \f$v < m1x1+ m2x2 + ... + mnxn\f$. For example, say your constraints are \f$3 < 2x + 4y - 7z\f$ and \f$y\f$ is positive, i.e. \f$0 < y\f$. Allocate and fill the matrix representing these two constraints via: \code apop_data *constr = apop_data_falloc((2,2,3), 3, 2, 4, 7, 0, 0, 1, 0); \endcode . Default: each elements is greater than zero. For three parameters this would be equivalent to setting \code apop_data *constr = apop_data_falloc((3,3,3), 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1); \endcode \param margin If zero, then this is a >= constraint, otherwise I will return a point this amount within the borders. You could try \c GSL_DBL_EPSILON, which is the smallest value a \c double can hold, or something like 1e-3. Default = 0. \return The penalty: the distance between beta and the closest point that meets the constraints. If the constraint is met, the penalty is zero. If the constraint is not met, this \c beta is shifted by \c margin (Euclidean distance) to meet the constraints. \li If your \ref apop_data has more structure than a vector, try \ref apop_data_pack to pack it into a vector. This is what \ref apop_maximum_likelihood does. \li The function doesn't check for odd cases like coplanar constraints. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC long double apop_linear_constraint(gsl_vector *beta, apop_data * constraint, double margin){ #else apop_varad_head(long double, apop_linear_constraint){ static threadlocal apop_data *default_constraint; gsl_vector * apop_varad_var(beta, NULL); double apop_varad_var(margin, 0); apop_data * apop_varad_var(constraint, NULL); Apop_assert(beta, "The vector to be checked is NULL."); if (!constraint){ if (default_constraint && beta->size != default_constraint->vector->size){ apop_data_free(default_constraint); default_constraint = NULL; } if (!default_constraint){ default_constraint = apop_data_alloc(0,beta->size, beta->size); default_constraint->vector = gsl_vector_calloc(beta->size); gsl_matrix_set_identity(default_constraint->matrix); } constraint = default_constraint; } return apop_linear_constraint_base(beta, constraint, margin); } long double apop_linear_constraint_base(gsl_vector *beta, apop_data * constraint, double margin){ #endif static threadlocal gsl_vector *closest_pt = NULL; static threadlocal gsl_vector *candidate = NULL; static threadlocal gsl_vector *fix = NULL; int constraint_ct = constraint->matrix->size1; int bindlist[constraint_ct]; int i, bound = 0; /* For added efficiency, keep a scratch vector or two on hand. */ if (closest_pt==NULL || closest_pt->size != constraint->matrix->size2){ closest_pt = gsl_vector_calloc(beta->size); candidate = gsl_vector_alloc(beta->size); fix = gsl_vector_alloc(beta->size); closest_pt->data[0] = GSL_NEGINF; } /* Do any constraints bind?*/ memset(bindlist, 0, sizeof(int)*constraint_ct); for (i=0; i< constraint_ct; i++){ gsl_vector *c = Apop_rv(constraint, i); bound += bindlist[i] = binds(beta, apop_data_get(constraint, i, -1), c, margin); } if (!bound) return 0; //All constraints met. gsl_vector *base_beta = apop_vector_copy(beta); /* With only one constraint, it's easy. */ if (constraint->vector->size==1){ gsl_vector *c = Apop_rv(constraint, 0); find_nearest_point(base_beta, constraint->vector->data[0], c, beta); goto add_margin; } /* Finally, multiple constraints, at least one binding. For each surface, pick a candidate point. Check whether the point meets the other constraints. if not, translate to a new point that works. [Do this by maintaining a pseudopoint that translates by the necessary amount.] Once you have a candidate point, compare its distance to the current favorite; keep the best. */ for (i=0; i< constraint_ct; i++) if (bindlist[i]){ get_candiate(base_beta, constraint, i, candidate, margin); if(apop_vector_distance(base_beta, candidate) < apop_vector_distance(base_beta, closest_pt)) gsl_vector_memcpy(closest_pt, candidate); } gsl_vector_memcpy(beta, closest_pt); add_margin: for (i=0; i< constraint_ct; i++){ if(bindlist[i]){ gsl_vector_memcpy(fix, Apop_rv(constraint, i)); gsl_vector_scale(fix, magnitude(fix)); gsl_vector_scale(fix, margin); gsl_vector_add(beta, fix); } } long double out = apop_vector_distance(base_beta, beta); gsl_vector_free(base_beta); return out; } apophenia-1.0+ds/apop_mapply.c000066400000000000000000000630101262736346100164170ustar00rootroot00000000000000 /** \file apop_mapply.c vector/matrix map/apply. */ /* Copyright (c) 2007, 2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. This file is a tour de force of the if statement. There are several possibilities: --user wants the vector, matrix, rows, columns, all items in the data set --user wants output in a new location, or written to the old. --user has extra parameters --user needs to know the index of the function --user wants the sum of the result (e.g., to find how many elements are NAN, or a sum of log-likelihoods). Further, Apophenia v0.22 introduced variadic, optional arguments, so we have a somewhat more robust syntax post-22, and also the prior syntax, which some may find useful. We thus have a lot of functions that all feed in to mapply_core, which, after several if statements, then dispatches segments do different threads, and either the vectorloop or forloop that does all the actual math. */ #include "apop_internal.h" #include static gsl_vector*mapply_core(apop_data *d, gsl_matrix *m, gsl_vector *vin, void *fn, gsl_vector *vout, bool use_index, bool use_param,void *param, char post_22, bool by_apop_rows); typedef double apop_fn_v(gsl_vector*); typedef void apop_fn_vtov(gsl_vector*); typedef double apop_fn_d(double); typedef void apop_fn_dtov(double*); typedef double apop_fn_r(apop_data*); typedef double apop_fn_vp(gsl_vector*, void *); typedef double apop_fn_dp(double, void *); typedef double apop_fn_rp(apop_data*, void *); typedef double apop_fn_vpi(gsl_vector*, void *, int); typedef double apop_fn_dpi(double, void *, int); typedef double apop_fn_rpi(apop_data*, void *, int); typedef double apop_fn_vi(gsl_vector*, int); typedef double apop_fn_di(double, int); typedef double apop_fn_ri(apop_data*, int); /** Apply a function to every element of a data set, matrix or vector; or, apply a vector-taking function to every row or column of a matrix. Your function could take any combination of a \c gsl_vector, a \c double, an \ref apop_data, a parameter set, and the position of the element in the vector or matrix. As such, the function takes twelve function inputs, one for each combination of vector/matrix, params/no params, index/no index. Fortunately, because this function uses the \ref designated syntax for inputs, you will specify only one. For example, here is a function that will cut off each element of the input data to between \f$(-1, +1)\f$. It takes in a lone \c double and a parameter in a \c void*, so it gets sent to \ref apop_map via .fn_dp=cutoff. \code double cutoff(double in, void *limit_in){ double *limit = limit_in; return GSL_MAX(-*limit, GSL_MIN(*limit, in)); } double param = 1; apop_map(your_data, .fn_dp=cutoff, .param=¶m, .inplace='y'); \endcode \param fn_v A function of the form double your_fn(gsl_vector *in) \param fn_d A function of the form double your_fn(double in) \param fn_r A function of the form double your_fn(apop_data *in) \param fn_vp A function of the form double your_fn(gsl_vector *in, void *param) \param fn_dp A function of the form double your_fn(double in, void *param) \param fn_rp A function of the form double your_fn(apop_data *in, void *param) \param fn_vpi A function of the form double your_fn(gsl_vector *in, void *param, int index) \param fn_dpi A function of the form double your_fn(double in, void *param, int index) \param fn_rpi A function of the form double your_fn(apop_data *in, void *param, int index) \param fn_vi A function of the form double your_fn(gsl_vector *in, int index) \param fn_di A function of the form double your_fn(double in, int index) \param fn_ri A function of the form double your_fn(apop_data *in, int index) \param in The input data set. If \c NULL, I'll return \c NULL immediately. \param param A pointer to the parameters to be passed to those function forms taking a \c *param. \param part Which part of the \c apop_data struct should I use?
'v'==Just the vector
'm'==Every element of the matrix, in turn
'a'==Both 'v' and 'm'
'r'==Apply a function \c gsl_vector \f$\to\f$ \c double to each row of the matrix
'c'==Apply a function \c gsl_vector \f$\to\f$ \c double to each column of the matrix
Default is 'a', but notice that I'll ignore a \c NULL vector or matrix, so if your data set has only a vector or only a matrix, that's what I'll use. \param all_pages If \c 'y', then follow the \c more pointer to subsequent pages. If \c 'n', handle only the first page of data. Default: \c 'n'. \param inplace If 'n' (the default), generate a new \ref apop_data set for output, which will contain the mapped values (and the names from the original set).
If 'y', modify in place. The \c double \f$\to\f$ \c double versions, \c 'v', \c 'm', and \c 'a', write to exactly the same location as before. The \c gsl_vector \f$\to\f$ \c double versions, \c 'r', and \c 'c', will write to the vector. Be careful: if you are writing in place and there is already a vector there, then the original vector is lost.
If 'v' (as in void), return \c NULL. (Default = 'n') \exception out->error='p' missing or mismatched parts error, such as \c NULL matrix when you sent a function acting on the matrix element. \li The function forms with r in them, like \c fn_ri, are row-by-row. I'll use \ref Apop_r to get each row in turn, and send it to the function. The first implication is that your function should be expecting a \ref apop_data set with exactly one row in it. The second is that \c part is ignored: it only makes sense to go row-by-row. \li For these \c r functions, if you set \c inplace='y', then you will be modifying your input data set, row by row; if you set \c inplace='n', then I will return an \ref apop_data set whose \c vector element is as long as your data set (i.e., as long as the longest of your text, vector, or matrix parts). \li If you set omp_set_num_threads(n) using \f$n>1\f$, split the data set into as many chunks as you specify and process them simultaneously. You need to watch out for the usual hang-ups about multithreaded programming, but if your data is iid, and each row's processing is independent of the others, you should have no problems. Bear in mind that generating threads takes some small overhead, so simple cases like adding a few hundred numbers will actually be slower when threading. \li See \ref mapply for many more examples and notes. \see apop_map_sum \ingroup all_public */ #ifdef APOP_NO_VARIADIC apop_data* apop_map(apop_data *in, apop_fn_d *fn_d, apop_fn_v *fn_v, apop_fn_r *fn_r, apop_fn_dp *fn_dp, apop_fn_vp *fn_vp, apop_fn_rp *fn_rp, apop_fn_dpi *fn_dpi, apop_fn_vpi *fn_vpi, apop_fn_rpi *fn_rpi, apop_fn_di *fn_di, apop_fn_vi *fn_vi, apop_fn_ri *fn_ri, void *param, int inplace, char part, int all_pages){ #else apop_varad_head(apop_data*, apop_map){ apop_data * apop_varad_var(in, NULL) if (!in) return NULL; apop_fn_v * apop_varad_var(fn_v, NULL) apop_fn_d * apop_varad_var(fn_d, NULL) apop_fn_r * apop_varad_var(fn_r, NULL) apop_fn_vp * apop_varad_var(fn_vp, NULL) apop_fn_dp * apop_varad_var(fn_dp, NULL) apop_fn_rp * apop_varad_var(fn_rp, NULL) apop_fn_vpi * apop_varad_var(fn_vpi, NULL) apop_fn_dpi * apop_varad_var(fn_dpi, NULL) apop_fn_rpi * apop_varad_var(fn_rpi, NULL) apop_fn_vi * apop_varad_var(fn_vi, NULL) apop_fn_di * apop_varad_var(fn_di, NULL) apop_fn_ri * apop_varad_var(fn_ri, NULL) int apop_varad_var(inplace, 'n') void * apop_varad_var(param, NULL) int by_vectors = fn_v || fn_vp || fn_vpi || fn_vi; char apop_varad_var(part, by_vectors ? 'r' : 'a') int apop_varad_var(all_pages, 'n') return apop_map_base(in, fn_d, fn_v, fn_r, fn_dp, fn_vp, fn_rp, fn_dpi, fn_vpi, fn_rpi, fn_di, fn_vi, fn_ri, param, inplace, part, all_pages); } apop_data* apop_map_base(apop_data *in, apop_fn_d *fn_d, apop_fn_v *fn_v, apop_fn_r *fn_r, apop_fn_dp *fn_dp, apop_fn_vp *fn_vp, apop_fn_rp *fn_rp, apop_fn_dpi *fn_dpi, apop_fn_vpi *fn_vpi, apop_fn_rpi *fn_rpi, apop_fn_di *fn_di, apop_fn_vi *fn_vi, apop_fn_ri *fn_ri, void *param, int inplace, char part, int all_pages){ #endif int use_param = (fn_vp || fn_dp || fn_rp || fn_vpi || fn_rpi || fn_dpi); int use_index = (fn_vi || fn_di || fn_ri || fn_vpi || fn_rpi|| fn_dpi); //Give me the first non-null input function. void *fn = fn_v ? (void *)fn_v : fn_d ? (void *)fn_d : fn_r ? (void *)fn_r : fn_vp ? (void *)fn_vp : fn_dp ? (void *)fn_dp :fn_rp ? (void *)fn_rp : fn_vpi ? (void *)fn_vpi : fn_rpi ? (void *)fn_rpi: fn_dpi ? (void *)fn_dpi : fn_vi ? (void *)fn_vi : fn_di ? (void *)fn_di : fn_ri ? (void *)fn_ri : NULL; int by_apop_rows = fn_r || fn_rp || fn_rpi || fn_ri; Apop_stopif((part=='c' || part=='r') && (fn_d || fn_dp || fn_dpi || fn_di), apop_return_data_error(p), 0, "You asked for a vector-oriented operation (.part='r' or .part='c'), but " "gave me a scalar-oriented function. Did you mean part=='a'?"); //Allocate output Get_vmsizes(in); //vsize, msize1, msize2, maxsize apop_data *out = (inplace=='y') ? in : (inplace=='v') ? NULL : by_apop_rows ? apop_data_alloc(GSL_MAX(in->textsize[0], maxsize)) : part == 'v' || (in->vector && ! in->matrix) ? apop_data_alloc(vsize) : part == 'm' ? apop_data_alloc(msize1, msize2) : part == 'a' ? apop_data_alloc(vsize, msize1, msize2) : part == 'r' ? apop_data_alloc(maxsize) : part == 'c' ? apop_data_alloc(msize2) : NULL; Apop_stopif(inplace=='y' && (part=='r'||part=='c') && !in->vector, in->vector=gsl_vector_alloc(maxsize), 2, "No vector in your input data set for me to write outputs to; " "allocating one for you of size %i", maxsize); if (in->names && out && !(inplace=='y')){ if (part == 'v' || (in->vector && ! in->matrix)) { apop_name_stack(out->names, in->names, 'v'); apop_name_stack(out->names, in->names, 'r'); } else if (part == 'm'){ apop_name_stack(out->names, in->names, 'r'); apop_name_stack(out->names, in->names, 'c'); } else if (!by_apop_rows && part == 'a'){ apop_name_free(out->names); out->names = apop_name_copy(in->names); } else if (by_apop_rows || part == 'r') apop_name_stack(out->names, in->names, 'r'); else if (part == 'c') apop_name_stack(in->names, out->names, 'r', 'c'); } if (by_apop_rows) mapply_core(in, NULL, NULL, fn, out ? out->vector : NULL, use_index, use_param, param, 'r', by_apop_rows); else { if (in->vector && (part == 'v' || part=='a')) mapply_core(NULL, NULL, in->vector, fn, out ? out->vector : NULL, use_index, use_param, param, 'r', by_apop_rows); if (in->matrix && (part == 'm' || part=='a')){ int smaller_dim = GSL_MIN(in->matrix->size1, in->matrix->size2); for (int i=0; i< smaller_dim; i++){ if (smaller_dim == in->matrix->size1){ gsl_vector *onevector = Apop_rv(in, i); if (inplace=='v') mapply_core(NULL, NULL, onevector, fn, NULL, use_index, use_param, param, 'r', by_apop_rows); else mapply_core(NULL, NULL, onevector, fn, Apop_rv(out, i), use_index, use_param, param, 'r', by_apop_rows); } else { gsl_vector *onevector = Apop_cv(in, i); if (inplace=='v') mapply_core(NULL, NULL, onevector, fn, NULL, use_index, use_param, param, 'c', by_apop_rows); else { gsl_vector *twovector = Apop_cv(out, i); mapply_core(NULL, NULL, onevector, fn, twovector, use_index, use_param, param, 'c', by_apop_rows); } } } } if (part == 'r' || part == 'c'){ Apop_stopif(!in->matrix, if (!out) out=apop_data_alloc(); out->error='p'; return out, 0, "You asked for me to operate on the %cs of the matrix, but the matrix is NULL.", part); mapply_core(NULL, in->matrix, NULL, fn, out ? out->vector : NULL, use_index, use_param, param, part, by_apop_rows); } } if ((all_pages=='y' || all_pages=='Y') && in->more){ out->more = apop_map_base(in->more, fn_d, fn_v, fn_r, fn_dp, fn_vp, fn_rp, fn_dpi, fn_vpi, fn_rpi, fn_di, fn_vi, fn_ri, param, inplace, part, all_pages); Apop_stopif(out->more->error, out->error=out->more->error, 1, "Error in subpage; marked parent page with same error code."); } return out; } /** \cond doxy_ignore */ typedef struct { void *fn; gsl_matrix *m; gsl_vector *v, *vin; apop_data *d; bool use_index, use_param; char rc; void *param; } threadpass; /** \endcond */ /* Mapply_core splits the database into an array of threadpass structs, then one of the following ...loop functions gets called, which does the actual for loop to step through the rows/columns/elements. */ static void rowloop(threadpass *tc){ apop_fn_r *rtod=tc->fn; apop_fn_rp *fn_rp=tc->fn; apop_fn_rpi *fn_rpi=tc->fn; apop_fn_ri *fn_ri=tc->fn; Get_vmsizes(tc->d); //maxsize OMP_for (int i=0; i< maxsize; i++){ apop_data *onerow = Apop_r(tc->d, i); double val = tc->use_param ? (tc->use_index ? fn_rpi(onerow, tc->param, i) : fn_rp(onerow, tc->param) ) : (tc->use_index ? fn_ri(onerow, i) : rtod(onerow) ); if (tc->v) gsl_vector_set(tc->v, i, val); } } static void forloop(threadpass *tc){ apop_fn_v *vtod=tc->fn; apop_fn_vp *fn_vp=tc->fn; apop_fn_vpi *fn_vpi=tc->fn; apop_fn_vi *fn_vi=tc->fn; int max = tc->rc == 'r' ? tc->m->size1 : tc->m->size2; OMP_for (int i= 0; i< max; i++){ gsl_vector view = tc->rc == 'r' ? gsl_matrix_row(tc->m, i).vector : gsl_matrix_column(tc->m, i).vector; double val = tc->use_param ? (tc->use_index ? fn_vpi(&view, tc->param, i) : fn_vp(&view, tc->param) ) : (tc->use_index ? fn_vi(&view, i) : vtod(&view) ); if (tc->v) gsl_vector_set(tc->v, i, val); } } static void oldforloop(threadpass *tc){ apop_fn_vtov *vtov=tc->fn; if (tc->v){ tc->rc = 'r'; return forloop(tc); } OMP_for (int i=0; i< tc->m->size1; i++) vtov(Apop_mrv(tc->m, i)); } //if mapping to self, then set tc.v = in_v static void vectorloop(threadpass *tc){ apop_fn_d *dtod=tc->fn; apop_fn_dp *fn_dp=tc->fn; apop_fn_dpi *fn_dpi=tc->fn; apop_fn_di *fn_di=tc->fn; OMP_for (int i= 0; i< tc->vin->size; i++){ double inval = gsl_vector_get(tc->vin, i); double outval = tc->use_param ? (tc->use_index ? fn_dpi(inval, tc->param, i) : fn_dp(inval, tc->param)) : (tc->use_index ? fn_di(inval, i) : dtod(inval)); if (tc->v) gsl_vector_set(tc->v, i, outval); } } static void oldvectorloop(threadpass *tc){ apop_fn_dtov *dtov=tc->fn; if (tc->v) return vectorloop(tc); OMP_for (int i= 0; i< tc->vin->size; i++){ double *inval = gsl_vector_ptr(tc->vin, i); dtov(inval); } } static gsl_vector*mapply_core(apop_data *d, gsl_matrix *m, gsl_vector *vin, void *fn, gsl_vector *vout, bool use_index, bool use_param, void *param, char post_22, bool by_apop_rows){ Get_vmsizes(d); //maxsize threadpass tp = (threadpass) { .fn = fn, .m = m, .d = d, .vin = vin, .v = vout, .use_index = use_index, .use_param= use_param, .param = param, .rc = post_22 }; if (by_apop_rows) rowloop(&tp); else if (m) post_22 ? forloop(&tp) : oldforloop(&tp); else post_22 ? vectorloop(&tp) : oldvectorloop(&tp); return vout; } /** Map a function onto every row of a matrix. The function that you input takes in a \c gsl_vector and returns a \c double. This function will produce a sequence of vector views of each row of the input matrix, and send each to your function. It will output a \c gsl_vector holding your function's output for each row. \param m The matrix \param fn A function of the form double fn(gsl_vector* in) \return A \c gsl_vector with the corresponding value for each row. \li If you input a \c NULL matrix, I return \c NULL. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ gsl_vector *apop_matrix_map(const gsl_matrix *m, double (*fn)(gsl_vector*)){ if (!m) return NULL; gsl_vector *out = gsl_vector_alloc(m->size1); return mapply_core(NULL, (gsl_matrix*) m, NULL, fn, out, 0, 0, NULL, 0, false); } /** Apply a function to every row of a matrix. The function that you input takes in a \c gsl_vector and returns nothing. \c apop_matrix_apply will produce a vector view of each row, and send each row to your function. \param m The matrix \param fn A function of the form void fn(gsl_vector* in) which may modify the data at the \c in pointer in place. \li If the matrix is \c NULL, this is a no-op and returns immediately. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ void apop_matrix_apply(gsl_matrix *m, void (*fn)(gsl_vector*)){ if (!m) return; mapply_core(NULL, m, NULL, fn, NULL, 0, 0, NULL, 0, false); } /** Map a function onto every element of a vector. Thus function will send each element to the function you provide, and will output a \c gsl_vector holding your function's output for each row. \param v The input vector \param fn A function of the form double fn(double in) \return A \c gsl_vector (allocated by this function) with the corresponding value for each row. \li If you input a \c NULL vector, I return \c NULL. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ gsl_vector *apop_vector_map(const gsl_vector *v, double (*fn)(double)){ if (!v) return NULL; gsl_vector *out = gsl_vector_alloc(v->size); return mapply_core(NULL, NULL, (gsl_vector*) v, fn, out, 0, 0, NULL, 0, false); } /** Apply a function to every row of a matrix. The function that you input takes in a \c double* and may modify the input value in place. This function will send a pointer to each element of your vector to your function. \param v The input vector \param fn A function of the form void fn(double in) \li If the vector is \c NULL, this is a no-op. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map */ void apop_vector_apply(gsl_vector *v, void (*fn)(double*)){ if (!v) return; mapply_core(NULL, NULL, v, fn, NULL, 0, 0, NULL, 0, false); } static void apop_matrix_map_all_vector_subfn(const gsl_vector *in, gsl_vector *outv, double (*fn)(double)){ mapply_core(NULL, NULL, (gsl_vector *) in, fn, outv, 0, 0, NULL, 0, false); } /** Maps a function to every element in a matrix (as opposed to every row). \param in The matrix whose elements will be inputs to the function \param fn A function with a form like double f(double in). \return a matrix of the same size as the original, with the function applied. \li If you input a \c NULL matrix, I return \c NULL. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ gsl_matrix * apop_matrix_map_all(const gsl_matrix *in, double (*fn)(double)){ if (!in) return NULL; gsl_matrix *out = gsl_matrix_alloc(in->size1, in->size2); OMP_for (size_t i=0; i< in->size1; i++){ gsl_vector_const_view inv = gsl_matrix_const_row(in, i); apop_matrix_map_all_vector_subfn(&inv.vector, Apop_mrv(out, i), fn); } return out; } /** Applies a function to every element in a matrix (as opposed to every row) \param in The matrix whose elements will be inputs to the function \param fn A function with a form like void f(double *in) which may modify the data at the \c in pointer in place. \li If the matrix is \c NULL, this is a no-op and returns immediately. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ void apop_matrix_apply_all(gsl_matrix *in, void (*fn)(double *)){ if (!in) return; OMP_for (size_t i=0; i< in->size1; i++){ apop_vector_apply(Apop_mrv(in, i), fn); } } /** Returns the sum of the output of \c apop_vector_map. For example, apop_vector_map_sum(v, isnan) returns the count of elements of v that are \c NaN. \li If you input a \c NULL vector, I return the sum of zero items: zero. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ double apop_vector_map_sum(const gsl_vector *in, double(*fn)(double)){ if (!in) return 0; gsl_vector *m = apop_vector_map (in, fn); double out = apop_vector_sum(m); gsl_vector_free(m); return out; } /** Like \c apop_matrix_map_all, but returns the sum of the resulting mapped function. For example, apop_matrix_map_all_sum(v, isnan) returns the number of elements of m that are \c NaN. \li If you input a \c NULL matrix, I return the sum of zero items: zero. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ double apop_matrix_map_all_sum(const gsl_matrix *in, double (*fn)(double)){ if (!in) return 0; gsl_matrix *m = apop_matrix_map_all (in, fn); double out = apop_matrix_sum(m); gsl_matrix_free(m); return out; } /** Like \c apop_matrix_map, but returns the sum of the resulting mapped vector. For example, let \c log_like be a function that returns the log likelihood of an input vector; then apop_matrix_map_sum(m, log_like) returns the total log likelihood of the rows of \c m. \li If you input a \c NULL matrix, I return the sum of zero items: zero. \li See \ref mapply "the map/apply page" for details. \see \ref apop_map, \ref apop_map_sum */ double apop_matrix_map_sum(const gsl_matrix *in, double (*fn)(gsl_vector*)){ if (!in) return 0; gsl_vector *v = apop_matrix_map (in, fn); double out = apop_vector_sum(v); gsl_vector_free(v); return out; } /** A function that effectively calls \ref apop_map and returns the sum of the resulting elements. Thus, this function returns a \c double. See the \ref apop_map page for details of the inputs, which are the same here, except that \c inplace doesn't make sense---this function will always just add up the input function outputs. \li I don't copy the input data to send to your input function. Therefore, if your function modifies its inputs as a side-effect, your data set will be modified as this function runs. \li The sum of zero elements is zero, so that is what is returned if the input \ref apop_data set is \c NULL. If apop_opts.verbose >= 2 print a warning. \li See \ref mapply for many more examples and notes. \li This function uses the \ref designated syntax for inputs. \ingroup all_public */ #ifdef APOP_NO_VARIADIC double apop_map_sum(apop_data *in, apop_fn_d *fn_d, apop_fn_v *fn_v, apop_fn_r *fn_r, apop_fn_dp *fn_dp, apop_fn_vp *fn_vp, apop_fn_rp *fn_rp, apop_fn_dpi *fn_dpi, apop_fn_vpi *fn_vpi, apop_fn_rpi *fn_rpi, apop_fn_di *fn_di, apop_fn_vi *fn_vi, apop_fn_ri *fn_ri, void *param, char part, int all_pages){ #else apop_varad_head(double, apop_map_sum){ apop_data * apop_varad_var(in, NULL) Apop_stopif(!in, return 0, 2, "NULL input. Returning zero."); apop_fn_v * apop_varad_var(fn_v, NULL) apop_fn_d * apop_varad_var(fn_d, NULL) apop_fn_r * apop_varad_var(fn_r, NULL) apop_fn_vp * apop_varad_var(fn_vp, NULL) apop_fn_dp * apop_varad_var(fn_dp, NULL) apop_fn_rp * apop_varad_var(fn_rp, NULL) apop_fn_vpi * apop_varad_var(fn_vpi, NULL) apop_fn_dpi * apop_varad_var(fn_dpi, NULL) apop_fn_rpi * apop_varad_var(fn_rpi, NULL) apop_fn_vi * apop_varad_var(fn_vi, NULL) apop_fn_di * apop_varad_var(fn_di, NULL) apop_fn_ri * apop_varad_var(fn_ri, NULL) void * apop_varad_var(param, NULL) char apop_varad_var(part, ((fn_v||fn_vp||fn_vpi||fn_vi) ? 'r' : 'a')); int apop_varad_var(all_pages, 'n') return apop_map_sum_base(in, fn_d, fn_v, fn_r, fn_dp, fn_vp, fn_rp, fn_dpi, fn_vpi, fn_rpi, fn_di, fn_vi, fn_ri, param, part, all_pages); } double apop_map_sum_base(apop_data *in, apop_fn_d *fn_d, apop_fn_v *fn_v, apop_fn_r *fn_r, apop_fn_dp *fn_dp, apop_fn_vp *fn_vp, apop_fn_rp *fn_rp, apop_fn_dpi *fn_dpi, apop_fn_vpi *fn_vpi, apop_fn_rpi *fn_rpi, apop_fn_di *fn_di, apop_fn_vi *fn_vi, apop_fn_ri *fn_ri, void *param, char part, int all_pages){ #endif apop_data *mapped = apop_map(in, .fn_d=fn_d, .fn_v=fn_v, .fn_r=fn_r, .fn_dp=fn_dp, .fn_vp=fn_vp, .fn_rp=fn_rp, .fn_dpi=fn_dpi, .fn_vpi=fn_vpi, .fn_rpi=fn_rpi, .fn_di=fn_di, .fn_vi=fn_vi, .fn_ri=fn_ri, .param=param, .part=part, .inplace='n', .all_pages='n'); double outsum = (mapped->vector ? apop_sum(mapped->vector) : 0) + (mapped->matrix ? apop_matrix_sum(mapped->matrix) : 0); apop_data_free(mapped); return outsum + (((all_pages=='y' || all_pages=='Y') && in->more) ? apop_map_sum_base(in->more, fn_d, fn_v, fn_r, fn_dp, fn_vp, fn_rp, fn_dpi, fn_vpi, fn_rpi, fn_di, fn_vi, fn_ri, param, part, all_pages) : 0); } /** \} */ apophenia-1.0+ds/apop_mcmc.c000066400000000000000000000405501262736346100160400ustar00rootroot00000000000000 /** \file Markov Chain Monte Carlo. */ /* Copyright (c) 2014 by Ben Klemens. Licensed under the GNU GPL v2; see COPYING. */ #include "apop_internal.h" #include ///default step and adapt fns. static void step_to_vector(double const *d, apop_mcmc_proposal_s *ps, apop_mcmc_settings *ms){ apop_model *m = ps->proposal; memcpy(m->parameters->vector->data, d, sizeof(double)*m->parameters->vector->size); ps->adapt_fn(ps, ms); } int sigma_adapt(apop_mcmc_proposal_s *ps, apop_mcmc_settings *ms){ apop_model *m = ps->proposal; //accept rate. Add 1% * target to numerator; 1% to denominator, to slow early jumps double ar = (ps->accept_count + .01*ms->periods *ms->target_accept_rate) /(ps->accept_count + ps->reject_count + .01*ms->periods); /* double std_dev_scale= (ar > ms->target_accept_rate) ? (2 - (1.-ar)/(1.-ms->target_accept_rate)) : (1/(2-((ar+0.0)/ms->target_accept_rate))); */ double scale = ar/ms->target_accept_rate; scale = 1+ (scale-1)/100.; //gsl_matrix_scale(m->parameters->matrix, scale > .1? ( scale < 10 ? scale : 10) : .1); gsl_matrix_scale(m->parameters->matrix, scale); return 0; } /////// apop_mcmc_settings Apop_settings_init(apop_mcmc, Apop_varad_set(periods, 6e3); Apop_varad_set(burnin, 0.05); Apop_varad_set(target_accept_rate, 0.35); Apop_varad_set(gibbs_chunks, 'b'); Apop_varad_set(start_at, '1'); Apop_varad_set(base_step_fn, step_to_vector); Apop_varad_set(base_adapt_fn, sigma_adapt); //all else defaults to zero/NULL ) Apop_settings_copy(apop_mcmc, if (in->block_count){ out->proposals = calloc(in->block_count, sizeof(apop_mcmc_proposal_s)); for (int i=0; i< in->block_count; i++){ out->proposals[i] = in->proposals[i]; out->proposals[i].proposal = apop_model_copy(in->proposals[i].proposal); } out->proposal_is_cp=1; } ) Apop_settings_free(apop_mcmc, if (in->proposal_is_cp) { for (int i=0; i< in->block_count; i++) apop_model_free(in->proposals[i].proposal); free(in->proposals); } ) static void setup_normal_proposals(apop_mcmc_proposal_s *s, int tsize, apop_mcmc_settings *settings){ apop_model *mvn = apop_model_copy(apop_multivariate_normal); mvn->parameters = apop_data_alloc(tsize, tsize, tsize); gsl_vector_set_all(mvn->parameters->vector, 1); gsl_matrix_set_identity(mvn->parameters->matrix); s->proposal = mvn; s->step_fn = settings->base_step_fn; s->adapt_fn = settings->base_adapt_fn; } static void set_block_count_and_block_starts(apop_data *in, apop_mcmc_settings *s, size_t total_len){ if (s->gibbs_chunks =='a') { s->block_count = 1; s->block_starts = calloc(2, sizeof(size_t)); s->block_starts[1] = total_len; } else if (s->gibbs_chunks =='b') { s->block_count = 0; for (apop_data *d = in; d; d=d->more) s->block_count += !!d->vector + !!d->matrix + !!d->weights; s->block_starts = calloc(s->block_count+1, sizeof(size_t)); int this=1, ctr=0; for (apop_data *d = in; d; d=d->more){ #define markit(test, value) if (test) \ s->block_starts[this++] = ctr += value; markit(d->vector, d->vector->size); markit(d->matrix, d->matrix->size1*d->matrix->size2); markit(d->weights, d->weights->size); } } else { // item-by-item s->block_count = total_len; s->block_starts = calloc(total_len+1, sizeof(size_t)); for (int i=1; iblock_starts[i] = i; } } static void one_step(apop_data *d, gsl_vector *draw, apop_model *m, apop_mcmc_settings *s, gsl_rng *rng, int *constraint_fails, apop_data *out, size_t block, int out_row){ gsl_vector *clean_copy = apop_vector_copy(draw); newdraw: apop_draw(draw->data + s->block_starts[block], rng, s->proposals[block].proposal); apop_data_unpack(draw, m->parameters); if (m->constraint && m->constraint(d, m)){ (*constraint_fails)++; goto newdraw; } double ll = apop_log_likelihood(d, m); Apop_notify(3, "ll=%g for parameters:\t", ll); if (apop_opts.verbose >=3) apop_data_print(m->parameters, .output_pipe=apop_opts.log_file); Apop_stopif(gsl_isnan(ll) || !isfinite(ll), goto newdraw, 1, "Trouble evaluating the m function at vector beginning with %g. " "Throwing it out and trying again.\n" , m->parameters->vector->data[0]); double ratio = ll - s->last_ll; if (ratio >= 0 || log(gsl_rng_uniform(rng)) < ratio){//success if (s->proposals[block].step_fn) s->proposals[block].step_fn(draw->data + s->block_starts[block], s->proposals+block, s); s->last_ll = ll; s->proposals[block].accept_count++; s->accept_count++; } else { s->proposals[block].reject_count++; s->reject_count++; Apop_notify(3, "reject, with exp(ll_now-ll_proposal) = exp(%g-%g) = %g.", ll, s->last_ll, exp(ratio)); gsl_vector_memcpy(draw, clean_copy); apop_data_unpack(draw, m->parameters); //keep the last success in m->parameters. } if (out_row>=0) gsl_vector_memcpy(Apop_rv(out, out_row), draw); } /** The draw method for models estimated via \ref apop_model_metropolis. That method produces an \ref apop_pmf, typically with a few thousand draws from the model in a batch. If you want to get a single next step from the Markov chain, use this. A Markov chain works by making a new draw and then accepting or rejecting the draw. If the draw is rejected, the last value is reported as the next step in the chain. Users sometimes mitigate this repetition by making a batch of draws (say, ten at a time) and using only the last. If you run this without first running \ref apop_model_metropolis, I will run it for you, meaning that there will be an initial burn-in period before the first draw that can be reported to you. That run is done using \c model->data as input. \param out An array of \c doubles, which will hold the draw, in the style of \ref apop_draw. \param rng A \c gsl_rng, already initialized, probably via \ref apop_rng_alloc. \param model A model which was probably already run through \ref apop_model_metropolis. \return On return, \c out is filled with the next step in the Markov chain. The ->data element of the PMF model is extended to include the additional steps in the chain. If a proposal failed the model constraints, then return 1; else return 0. See the notes in the documentation for \ref apop_model_metropolis. \li After pulling the attached settings group, the parent model is ignored. One expects that \c base_model in the mcmc settings group == the parent model. \li If your settings break the model parameters into several chunks, this function returns after stepping through all chunks. \ingroup all_public */ int apop_model_metropolis_draw(double *out, gsl_rng* rng, apop_model *model){ apop_mcmc_settings *s = apop_settings_get_group(model, apop_mcmc); if (!s || !s->pmf) { apop_model_metropolis(model->data, rng, model); s = apop_settings_get_group(model, apop_mcmc); } int constraint_fails = 0; OMP_critical (metro_draw) { apop_model *m = s->base_model; gsl_vector_view vv = gsl_vector_view_array(out, s->block_starts[s->block_count]); apop_data_pack(m->parameters, &(vv.vector)); apop_data *earlier_draws = s->pmf->data; int block = 0, done = 0; while (!done){ s->proposal_count++; earlier_draws->matrix = apop_matrix_realloc(earlier_draws->matrix, earlier_draws->matrix->size1+1, earlier_draws->matrix->size2); one_step(s->base_model->data, &(vv.vector), m, s, rng, &constraint_fails, earlier_draws, block, earlier_draws->matrix->size1-1); block = (block+1) % s->block_count; done = !block; //have looped back to the start. s->proposals[block].adapt_fn(s->proposals+block, s); } Apop_stopif(constraint_fails, , 2, "%i proposals failed to meet your model's parameter constraints", constraint_fails); } return !!constraint_fails; } void main_mcmc_loop(apop_data *d, apop_model *m, apop_data *out, gsl_vector *draw, apop_mcmc_settings *s, gsl_rng *rng, int *constraint_fails){ double integerpart_periods_burnin = GSL_NAN; modf((double)(s->periods)*s->burnin,&integerpart_periods_burnin); s->accept_count = 0; int out_row = -lround(integerpart_periods_burnin); int block = 0; for (s->proposal_count=1; s->proposal_count< s->periods+1; s->proposal_count++, out_row++){ one_step(d, draw, m, s, rng, constraint_fails, out, block, out_row); block = (block+1) % s->block_count; s->proposals[block].adapt_fn(s->proposals+block, s); //if (constraint_fails>10000) break; } } /** Use Metropolis-Hastings Markov chain Monte Carlo to make draws from the given model. The basic storyline is that draws are made from a proposal distribution, and the likelihood of your model given your data and the drawn parameters evaluated. At each step, a new set of proposal parameters are drawn, and if they are more likely than the previous set the new proposal is accepted as the next step, else with probability (prob of new params)/(prob of old params), they are accepted as the next step anyway. Otherwise the last accepted proposal is repeated. The output is an \ref apop_pmf model with a data set listing the draws that were accepted, including those repetitions. The output model is modified so that subsequent draws are one more step from the Markov chain, via \ref apop_model_metropolis_draw. \param d The \ref apop_data set used for evaluating the likelihood of a proposed parameter set. \param rng A \c gsl_rng, probably allocated via \ref apop_rng_alloc. (Default: an RNG from \ref apop_rng_get_thread) \param m The \ref apop_model from which parameters are being drawn. (No default; must not be \c NULL) \return A modified \ref apop_pmf model representing the results of the search. It has a specialized \c draw method that returns another step from the Markov chain with each draw. \exception out->error='c' Proposal was outside of a constraint; see below. \li If a proposal fails to meet the \c constraint element of the model you input, then the proposal is thrown out and a new one selected. By the default proposal distribution, this is not mathematically correct (it breaks detailed balance), and values near the constraint will be oversampled. The output model will have outmodel->error=='c'. It is up to you to decide whether the resulting distribution is good enough for your purposes or whether to take the time to write a custom proposal and step function to accommodate the constraint. Attach an \ref apop_mcmc_settings group to your model to specify the proposal distribution, burnin, and other details of the search. See the \ref apop_mcmc_settings documentation for details. \li The default proposal includes an adaptive step: you specify a target accept rate (default: .35), and if the accept rate is currently higher the variance of the proposals is widened to explore more of the space; if the accept rate is currently lower the variance is narrowed to stay closer to the last accepted proposal. Technically, this breaks ergodicity of the Markov chain, but the consensus seems to be that this is not a serious problem. If it does concern you, you can set the \c base_adapt_fn in the \ref apop_mcmc_settings group to a do-nothing function, or one that damps its adaptation as \f$n\to\infty\f$. \li If you have a univariate model, \ref apop_arms_draw may be a suitable simpler alternative. \li Note the \c gibbs_chunks element of the \ref apop_mcmc_settings group. If you set \c gibbs_chunks='a', all parameters are drawn as a set, and accepted/rejected as a set. The variances are adapted at an identical rate. If you set \c gibbs_chunks='i', then each scalar parameter is assigned its own proposal distribution, which is adapted at its own pace. With \c gibbs_chunks='b' (the default), then each of the vector, matrix, and weights of your model's parameters are drawn/accepted/adapted as a block (and so on to additional chunks if your model has ->more pages). This works well for complex models which naturally break down into subsets of parameters. \li Each chunk counts as a step in the Markov chain. Therefore, if there are several chunks, you can expect chunks to repeat from step to step. If you want a draw after cycling through all chunks, try using \ref apop_model_metropolis_draw, which has that behavior. \li If the likelihood model has \c NULL parameters, I will allocate them. That means you can use one of the stock models that ship with Apophenia. If I need to run the model's prep routine to get the size of the parameters, then I will make a copy of the likelihood model, run prep, and then allocate parameters for that copy of a model. \li On exit, the \c parameters element of your likelihood model has the last accepted parameter proposal. \li If you set apop_opts.verbose=2 or greater, I will report the accept rate of the M-H sampler. It is a common rule of thumb to select a proposal so that this is between 20% and 50%. Set apop_opts.verbose=3 to see the stream of proposal points, their likelihoods, and the acceptance odds. You may want to set apop_opts.log_file=fopen("yourlog", "w") first. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_model * apop_model_metropolis(apop_data *d, gsl_rng *rng, apop_model *m){ #else apop_varad_head(apop_model *, apop_model_metropolis){ apop_data *apop_varad_var(d, NULL); apop_model *apop_varad_var(m, NULL); Apop_stopif(!m, return NULL, 0, "NULL model input."); gsl_rng *apop_varad_var(rng, apop_rng_get_thread(-1)); return apop_model_metropolis_base(d, rng, m); } apop_model * apop_model_metropolis_base(apop_data *d, gsl_rng *rng, apop_model *m){ #endif apop_model *outp; OMP_critical(metropolis) { apop_mcmc_settings *s = apop_settings_get_group(m, apop_mcmc); if (!s) s = Apop_model_add_group(m, apop_mcmc); apop_prep(d, m); //typically a no-op s->last_ll = GSL_NEGINF; gsl_vector * drawv = apop_data_pack(m->parameters); const double double_periods = (double)(s->periods); Apop_stopif(s->burnin > 1, s->burnin/=double_periods, 1, "Burn-in should be a fraction of the number of periods, " "not a whole number of periods. Rescaling to burnin=%g." , s->burnin/double_periods); double integerpart_periods_cburnin = GSL_NAN; modf(double_periods*(1.0-s->burnin),&integerpart_periods_cburnin); const size_t data_size1 = llround(integerpart_periods_cburnin); apop_data *out = apop_data_alloc(data_size1, drawv->size); if (!s->proposals){ set_block_count_and_block_starts(m->parameters, s, drawv->size); s->proposals = calloc(s->block_count, sizeof(apop_mcmc_proposal_s)); s->proposal_is_cp = 1; for (int i=0; i< s->block_count; i++){ apop_mcmc_proposal_s *p = s->proposals+i; setup_normal_proposals(p, s->block_starts[i+1]-s->block_starts[i], s); if (!p->proposal->parameters) { apop_prep(NULL, p->proposal+i); if(p->proposal->parameters->matrix) gsl_matrix_set_all(p->proposal->parameters->matrix, 1); if(p->proposal->parameters->vector) gsl_vector_set_all(p->proposal->parameters->vector, 1); } } } //if s->start_at =='p', we already have m->parameters in drawv. if (s->start_at == '1') gsl_vector_set_all(drawv, 1); int constraint_fails = 0; main_mcmc_loop(d, m, out, drawv, s, rng, &constraint_fails); Apop_notify(2, "M-H sampling accept percent = %3.3f%%", 100*(0.0+s->accept_count)/s->periods); Apop_stopif(constraint_fails, out->error='c', 2, "%i proposals failed to meet your model's parameter constraints", constraint_fails); out->weights = gsl_vector_alloc(s->periods*(1-s->burnin)); gsl_vector_set_all(out->weights, 1); outp = apop_estimate(out, apop_pmf); s->pmf = outp; s->base_model = m; outp->draw = apop_model_metropolis_draw; apop_settings_copy_group(outp, m, "apop_mcmc"); gsl_vector_free(drawv); } return outp; } apophenia-1.0+ds/apop_missing_data.c000066400000000000000000000151751262736346100175700ustar00rootroot00000000000000 /** \file apop_missing_data.c Some missing data handlers. */ /* Copyright (c) 2007, 2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include /** If there is an NaN anywhere in the row of data (including the matrix, the vector, the weights, and the text) then delete the row from the data set. \li If every row has a NaN, then this returns \c NULL. \li If \c apop_opts.nan_string is not \c NULL, then I will make case-insensitive comparisons to the text elements to check for bad data as well. \li If \c inplace = 'y', then I'll free each element of the input data set and refill it with the pruned elements. I'll still take up (up to) twice the size of the data set in memory during the function. If every row has a NaN, then your \c apop_data set will end up with \c NULL vector, matrix, .... if \c inplace = 'n', then the original data set is left where it was, though internal elements may be moved. \li I only look at the first page of data (i.e. the \c more element is ignored). \li Listwise deletion is often not a statistically valid means of dealing with missing data. It is typically better to impute the data (preferably multiple times). See \ref apop_ml_impute for a less-invalid means, or Tea for survey imputation for heavy-duty survey editing and imputation. \li This function uses the \ref designated syntax for inputs. \param d The data, with NaNs \param inplace If \c 'y', clear out the pointer-to-\ref apop_data that you sent in and refill with the pruned data. If \c 'n', leave the set alone and return a new data set. Default=\c 'n'. \return A (potentially shorter) copy of the data set, without NaNs. If inplace=='y', a pointer to the input, which was shortened in place. If the entire data set is cleared out, then this will be \c NULL. \see apop_data_rm_rows */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_listwise_delete(apop_data *d, char inplace){ #else apop_varad_head(apop_data *, apop_data_listwise_delete){ apop_data * apop_varad_var(d, NULL); if (!d) return NULL; char apop_varad_var(inplace, 'n'); return apop_data_listwise_delete_base(d, inplace); } apop_data * apop_data_listwise_delete_base(apop_data *d, char inplace){ #endif Get_vmsizes(d) //defines firstcol, vsize, wsize, msize1, msize2. Apop_stopif(!msize1 && !vsize && !*d->textsize, return NULL, 0, "You sent to apop_data_listwise_delete a data set with NULL matrix, NULL vector, and no text. " "Confused, it is returning NULL."); //find out where the NaNs are int len = GSL_MAX(vsize ? vsize : msize1, d->textsize[0]); //still some size assumptions here. int not_empty = 0; int *marked = calloc(len, sizeof(int)); for (int i=0; i< (vsize ? vsize: msize1); i++) for (int j=firstcol; j weights, i))) marked[i] = 1; if (d->textsize[0] && apop_opts.nan_string){ for(int i=0; i< d->textsize[0]; i++) if (!marked[i]) for(int j=0; j< d->textsize[1]; j++) if (!strcasecmp(apop_opts.nan_string, d->text[i][j])){ marked[i] ++; break; } } //check that at least something isn't NULL. for (int i=0; i< len; i++) if (!marked[i]){ not_empty ++; break; } if (!not_empty){ free(marked); return NULL; } apop_data *out = (inplace=='y'|| inplace=='Y') ? d : apop_data_copy(d); apop_data_rm_rows(out, marked); free(marked); return out; } //ML imputation /** \hideinitializer */ #define Switch_back \ apop_data *real_data = ml_model->parameters; \ apop_model *actual_base = ml_model->more; \ actual_base->parameters = d; static void i_est(apop_data *d, apop_model *ml_model){ Switch_back actual_base = apop_estimate(real_data, actual_base); } static long double i_ll(apop_data *d, apop_model *ml_model){ Switch_back return apop_log_likelihood(real_data, actual_base); } static long double i_p(apop_data *d, apop_model *ml_model){ Switch_back return apop_p(real_data, actual_base); } //doesn't actually move the parameters static long double i_constraint(apop_data *d, apop_model *ml_model){ Switch_back if (!actual_base->constraint) return 0; apop_data *original_params = apop_data_copy(actual_base->parameters); long double out = actual_base->constraint(real_data, actual_base); if (out) apop_data_memcpy(actual_base->parameters, original_params); apop_data_free(original_params); return out; } apop_model *apop_swap_model = &(apop_model){"Model with data and params swapped", .estimate=i_est, .p = i_p, .log_likelihood=i_ll, .constraint = i_constraint}; /** Impute the most likely data points to replace NaNs in the data, and insert them into the given data. That is, the data set is modified in place. How it works: this uses the machinery for \ref apop_model_fix_params. The only difference is that this searches over the data space and takes the parameter space as fixed, while basic fix params model searches parameters and takes data as fixed. So this function just does the necessary data-parameter switching to make that happen. \param d The data set. It comes in with NaNs and leaves entirely filled in. \param mvn A parametrized \ref apop_model from which you expect the data was derived. if \c NULL, then I'll use the Multivariate Normal that best fits the data after listwise deletion. \return An estimated \ref apop_model. Also, the data input will be filled in and ready to use. */ apop_model * apop_ml_impute(apop_data *d, apop_model* mvn){ if (!mvn){ apop_data *list_d = apop_data_listwise_delete(d); Apop_stopif(!list_d, return NULL, 0, "Listwise deletion returned no whole rows, " "so I couldn't fit a Multivariate Normal to your data. " "Please provide a pre-estimated initial model."); mvn = apop_estimate(list_d, apop_multivariate_normal); apop_data_free(list_d); } apop_model *impute_me = apop_model_copy(apop_swap_model); impute_me->parameters = d; impute_me->more = mvn; apop_model *fixed = apop_model_fix_params(impute_me); Apop_model_add_group(fixed, apop_parts_wanted); apop_model *m = apop_estimate(mvn->parameters, fixed); apop_model_free(fixed); return m; } apophenia-1.0+ds/apop_mle.c000066400000000000000000001224141262736346100156760ustar00rootroot00000000000000 /** \file apop_mle.c */ /*Copyright (c) 2006--2010 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include #include #include #include #include #include #include typedef long double (*apop_fn_with_params) (apop_data *, apop_model *); typedef void (*apop_df_with_void)(const gsl_vector *beta, void *d, gsl_vector *gradient); typedef void (*apop_fdf_with_void)(const gsl_vector *beta, void *d, double *f, gsl_vector *df); /** \cond doxy_ignore */ typedef struct { gsl_vector *beta; int dimension; } grad_params; typedef struct { apop_model *model; apop_data *data; apop_fn_with_params *f; grad_params *gp; //Used only by apop_internal_numerical_gradient. gsl_vector *beta, *starting_pt; int use_constraint; double best_ll; char want_cov, want_predicted, want_tests, want_info; jmp_buf bad_eval_jump; apop_data** path; } infostruct; /** \endcond */ //End of Doxygen ignore. static apop_model * find_roots (infostruct p); //see end of file. //as a macro, we can put it in documentation /* Generate support fns (esp. initializers) for apop_mle_settings and apop_parts_wanted structs. */ Apop_settings_copy(apop_parts_wanted, ) Apop_settings_free(apop_parts_wanted, ) Apop_settings_init(apop_parts_wanted, Apop_varad_set(covariance, 'n'); Apop_varad_set(predicted, 'n'); Apop_varad_set(tests, 'n'); Apop_varad_set(info, 'n'); ) Apop_settings_copy(apop_mle, ) Apop_settings_free(apop_mle, ) Apop_settings_init(apop_mle, Apop_varad_set(starting_pt, NULL); Apop_varad_set(tolerance, 1e-5); Apop_varad_set(max_iterations, 5000); Apop_varad_set(method, "");//default picked in apop_maximum_likelihood Apop_varad_set(verbose, 0); Apop_varad_set(step_size, 0.05); Apop_varad_set(delta, 1e-3); Apop_varad_set(dim_cycle_tolerance, 0); //siman: //siman also uses step_size = 1.; Apop_varad_set(n_tries, 5); //The number of points to try for each step. Apop_varad_set(iters_fixed_T, 5); //The number of iterations at each temperature. Apop_varad_set(k, 1.0); //The maximum step size in the random walk. Apop_varad_set(t_initial, 50); //cooling schedule data Apop_varad_set(mu_t, 1.002); Apop_varad_set(t_min, 5.0e-1); Apop_varad_set(rng, NULL); ) // MLE support functions //Including numerical differentiation and a couple of functions to //negate the likelihood fns without bothering the user. static void apop_annealing(infostruct*); //below. static double one_d(double b, void *in){ infostruct *i = in; long double penalty = 0; gsl_vector_set(i->gp->beta, i->gp->dimension, b); apop_data_unpack(i->gp->beta, i->model->parameters); if (i->model->constraint) penalty = i->model->constraint(i->data, i->model); return (*(i->f))(i->data, i->model) + penalty; } //Numeric first and second derivatives. /* For each element of the parameter set, jiggle it to find its gradient. Return a vector as long as the parameter list. */ static void apop_internal_numerical_gradient(apop_fn_with_params ll, infostruct* info, gsl_vector *out, double delta){ double result, err; gsl_vector *beta = apop_data_pack(info->model->parameters); infostruct i = *info; i.f = ≪ i.gp = &(grad_params){ .beta = gsl_vector_alloc(beta->size)}; gsl_function F = { .function= one_d, .params = &i }; for (size_t j=0; j< beta->size; j++){ i.gp->dimension = j; gsl_vector_memcpy(i.gp->beta, beta); gsl_deriv_central(&F, gsl_vector_get(beta,j), delta, &result, &err); gsl_vector_set(out, j, result); } gsl_vector_free(beta); } /** A wrapper around the GSL's one-dimensional \c gsl_deriv_central to find a numeric differential for each dimension of the input \ref apop_model's log likelihood (or \c p if \c log_likelihood is \c NULL). \param data The \ref apop_data set to use for all evaluations. \param model The \ref apop_model, expressing the function whose derivative is sought. The gradient is taken via small changes along the model parameters. \param delta The size of the differential. (default: 1e-3, but see below) \code gsl_vector *gradient = apop_numerical_gradient(data, your_parametrized_model); \endcode \li If you do not set \ref delta as an input, I first look for an \ref apop_mle_settings group attached to the input model, and check that for a \c delta element. If that is also missing, use the default of 1e-3. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC gsl_vector * apop_numerical_gradient(apop_data *data, apop_model *model, double delta){ #else apop_varad_head(gsl_vector *, apop_numerical_gradient){ apop_data * apop_varad_var(data, NULL); apop_model * apop_varad_var(model, NULL); Nullcheck(model, NULL) Nullcheck_p(model, NULL) double apop_varad_var(delta, 0); if (!delta){ apop_mle_settings *mp = apop_settings_get_group(model, apop_mle); delta = mp ? mp->delta : 1e-3; } return apop_numerical_gradient_base(data, model, delta); } gsl_vector * apop_numerical_gradient_base(apop_data *data, apop_model *model, double delta){ #endif Get_vmsizes(model->parameters); //tsize apop_fn_with_params ll = model->log_likelihood ? model->log_likelihood : model->p; Apop_stopif(!ll, return NULL, 0, "Input model has neither p nor log_likelihood method. Returning NULL."); gsl_vector *out = gsl_vector_calloc(tsize); infostruct i = (infostruct) {.model = model, .data = data}; apop_internal_numerical_gradient(ll, &i, out, delta); return out; } /** \cond doxy_ignore */ typedef struct { apop_model *base_model; int *current_index; } apop_model_for_infomatrix_struct; /** \endcond */ static long double apop_fn_for_infomatrix(apop_data *d, apop_model *m){ static threadlocal gsl_vector *v = NULL; apop_model_for_infomatrix_struct *settings = m->more; apop_model *mm = settings->base_model; apop_score_type ms = apop_score_vtable_get(mm); if (ms){ Get_vmsizes(mm->parameters); //tsize if (!v || v->size != tsize){ if (v) gsl_vector_free(v); v = gsl_vector_alloc(tsize); } ms(d, v, mm); return gsl_vector_get(v, *settings->current_index); } //else: gsl_vector *vv = apop_numerical_gradient(d, mm); double out = gsl_vector_get(vv, *settings->current_index); gsl_vector_free(vv); return out; } apop_model *apop_model_for_infomatrix = &(apop_model){"Ad hoc model for working out the information matrix.", .log_likelihood = apop_fn_for_infomatrix}; /** Numerically estimate the matrix of second derivatives of the parameter values, via a series of re-evaluations at small differential steps. [Therefore, it may be expensive to do this for a very computationally-intensive model.] \param data The \ref apop_data at which the model was estimated (default: \c NULL) \param model The \ref apop_model, with parameters already estimated (no default, must not be \c NULL) \param delta the step size for the differentials. (default: 1e-3, but see below) \return The matrix of estimated second derivatives at the given data and parameter values. \li If you do not set \ref delta as an input, I first look for an \ref apop_mle_settings group attached to the input model, and check that for a \c delta element. If that is also missing, use the default of 1e-3. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_model_hessian(apop_data * data, apop_model *model, double delta){ #else apop_varad_head(apop_data *, apop_model_hessian){ apop_data * apop_varad_var(data, NULL); apop_model * apop_varad_var(model, NULL); Nullcheck(model, NULL) double apop_varad_var(delta, 0); if (!delta){ apop_mle_settings *mp = apop_settings_get_group(model, apop_mle); delta = mp ? mp->delta : 1e-3; } return apop_model_hessian_base(data, model, delta); } apop_data * apop_model_hessian_base(apop_data * data, apop_model *model, double delta){ #endif int k; Get_vmsizes(model->parameters) //tsize size_t betasize = tsize; apop_data *out = apop_data_calloc(0, betasize, betasize); gsl_vector *dscore = gsl_vector_alloc(betasize); apop_model_for_infomatrix_struct ms = { .base_model = model, .current_index = &k, }; apop_model *m = apop_model_copy(apop_model_for_infomatrix); m->parameters = model->parameters; m->more = &ms; if (apop_settings_get_group(model, apop_mle)) apop_settings_copy_group(m, model, "apop_mle"); for (k=0; k< betasize; k++){ dscore = apop_numerical_gradient(data, m, delta); //We get two estimates of the (k,j)th element, which are often very close, //and take the mean. for (size_t j=0; j< betasize; j++){ *gsl_matrix_ptr(out->matrix, k, j) += gsl_vector_get(dscore, j)/2; *gsl_matrix_ptr(out->matrix, j, k) += gsl_vector_get(dscore, j)/2; } gsl_vector_free(dscore); } if (model->parameters->names->row){ apop_name_stack(out->names, model->parameters->names, 'r'); apop_name_stack(out->names, model->parameters->names, 'c', 'r'); } return out; } /** Produce the covariance matrix for the parameters of an estimated model via the derivative of the score function at the parameter. I.e., I find the second derivative via \ref apop_model_hessian , and take the negation of the inverse. I follow Efron and Hinkley in using the estimated information matrix---the value of the information matrix at the estimated value of the score---not the expected information matrix that is the integral over all possible data. See Pawitan 2001 (who cribbed a little off of Efron and Hinkley) or Klemens 2008 (who directly cribbed off of both) for further details. \param data The data by which your model was estimated \param model A model whose parameters have been estimated. \param delta The differential by which to step for sampling changes. (default: 1e-3, but see below) \return A covariance matrix for the data. Also, if the data does not have a "" page, I'll set it to the result as well [i.e., I won't overwrite an existing covariance page]. \li If you do not set \ref delta as an input, I first look for an \ref apop_mle_settings group attached to the input model, and check that for a \c delta element. If that is also missing, use the default of 1e-3. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_model_numerical_covariance(apop_data * data, apop_model *model, double delta){ #else apop_varad_head(apop_data *, apop_model_numerical_covariance){ apop_data * apop_varad_var(data, NULL); apop_model * apop_varad_var(model, NULL); Nullcheck(model, NULL) double apop_varad_var(delta, 0); if (!delta){ apop_mle_settings *mp = apop_settings_get_group(model, apop_mle); delta = mp ? mp->delta : 1e-3; } return apop_model_numerical_covariance_base(data, model, delta); } apop_data * apop_model_numerical_covariance_base(apop_data * data, apop_model *model, double delta){ #endif apop_data *hessian = apop_model_hessian(data, model, delta); if (apop_opts.verbose > 1){ printf("The estimated Hessian:\n"); apop_data_show(hessian); } apop_data *out = apop_data_alloc(); out->matrix = apop_matrix_inverse(hessian->matrix); gsl_matrix_scale(out->matrix, -1); if (hessian->names->row){ apop_name_stack(out->names, hessian->names, 'r'); apop_name_stack(out->names, hessian->names, 'c'); } apop_data_free(hessian); if (!apop_data_get_page(model->parameters, "")) apop_data_add_page(model->parameters, out, ""); return out; } ///On to the interfaces between the models and the methods static void tracepath(const gsl_vector *beta, double value, apop_data **path){ size_t msize1 = (*path && (*path)->matrix) ? (*path)->matrix->size1: 0; if (!*path) { *path = apop_data_alloc(); (*path)->names->title = strdup("Path of ML search"); apop_name_add((*path)->names, "f(x)", 'v'); apop_name_add((*path)->names, "x", 'm'); } (*path)->matrix = apop_matrix_realloc((*path)->matrix, msize1+1, beta->size); gsl_vector_memcpy(Apop_rv(*path, msize1), beta); (*path)->vector = apop_vector_realloc((*path)->vector, msize1+1); gsl_vector_set((*path)->vector, msize1, value); } /* Every actual evaluation of the function go through the negshell and dnegshell fns, because there are several things that have to be done beyond just getting model.log_likelihood: --Negate, because statisticians and social scientists like to maximize; physicists like to minimize. --Work out if the model provides log_likelihood or p. --Call \ref trace_path if needed. --Go from a single vector to a full apop_data set and back (via apop_data_pack/unpack) --Check the derivative function if available. --Check constraints. */ static double negshell (const gsl_vector *beta, void * in){ infostruct *i = in; double penalty = 0, out = 0; long double (*f)(apop_data *, apop_model *); f = i->model->log_likelihood? i->model->log_likelihood : i->model->p; Apop_stopif(!f, longjmp(i->bad_eval_jump, -1), 0, "The model you sent to the MLE function has neither log_likelihood element nor p element."); apop_data_unpack(beta, i->model->parameters); if (i->use_constraint && i->model->constraint) penalty = i->model->constraint(i->data, i->model); if (penalty) apop_data_pack(i->model->parameters, (gsl_vector*) beta); double f_val = f(i->data, i->model); out = penalty - f_val; //negative llikelihood Apop_stopif(gsl_isnan(out), longjmp(i->bad_eval_jump, -1), 0, "I got a NaN in evaluating the objective function.%s", !i->model->constraint ? " Maybe add a constraint to your model?" : ""); if (i->path) tracepath(i->model->parameters->vector, -out, i->path); if (i->want_info =='y'){ //I report the log likelihood under the assumption that the final param set //matches the best ll evaluated. long double this_ll = i->model->log_likelihood? -out : log(-out); //negative negative llikelihood. if(gsl_isnan(this_ll)){ Apop_stopif(!i->model->log_likelihood && penalty > f_val, i->want_info='n', 1, "Model's p=%g, penalty=%g, for a negative adjusted p=%g. " "Continuing, but can not report covariance or other " "log likelihood-based statistics.", f_val, penalty, f_val-penalty); Apop_stopif(1, apop_data_show(i->model->parameters); i->want_info='n', 1, "NaN resulted from the following value tried by the maximum likelihood system."); } i->best_ll = GSL_MAX(i->best_ll, this_ll); } return out; } static int dnegshell (const gsl_vector *beta, void * in, gsl_vector * g){ /* The derivative-calculating routine. If the constraint binds then: take the numerical derivative of negshell, which will be the numerical derivative of the penalty. else: just find dlog_likelihood. If the model doesn't have a dlog likelihood or the user asked to ignore it, then the main maximum likelihood fn replaced model.score with apop_numerical_gradient anyway. Finally, reverse the sign, since the GSL is trying to minimize instead of maximize. */ infostruct *i = in; apop_mle_settings *mp = apop_settings_get_group(i->model, apop_mle); apop_data_unpack(beta, i->model->parameters); /* In all cases, negshell gets called first, so the constraint is already checked and beta nudged accordingly. if(i->model->constraint && i->model->constraint(i->data, i->model)) apop_data_pack(i->model->parameters, (gsl_vector *) beta); */ apop_score_type ms = apop_score_vtable_get(i->model); if (ms) ms(i->data, g, i->model); else { apop_fn_with_params ll = i->model->log_likelihood ? i->model->log_likelihood : i->model->p; apop_internal_numerical_gradient(ll, i, g, mp->delta); } if (mp->path) negshell (beta, in); gsl_vector_scale(g, -1); return GSL_SUCCESS; } //This is just to satisfy the GSL's format. static void fdf_shell(const gsl_vector *beta, void *i, double *f, gsl_vector *df){ *f = negshell(beta, i); dnegshell(beta, i, df); } static int ctrl_c; static void mle_sigint(){ ctrl_c ++; } static int setup_starting_point(apop_mle_settings *mp, gsl_vector *x){ Apop_stopif(!x, return -1, 0, "The vector I'm trying to optimize over is NULL."); if (!mp->starting_pt) gsl_vector_set_all (x, 1); else for (int i=0; i< x->size; i++) x->data[i] = mp->starting_pt[i]; return 0; } void add_info_criteria(apop_data *d, apop_model *m, apop_model *est, double ll, int param_ct){ //Did the sending function save last value of f()? if (!ll) ll = apop_log_likelihood(d, m); if (!est->info) est->info = apop_data_alloc(); apop_data_add_named_elmt(est->info, "log likelihood", ll); double AIC = 2*param_ct - 2 *ll; apop_data_add_named_elmt(est->info, "AIC", AIC); if (d){//some models have NULL data. int n; {Get_vmsizes(d); n = maxsize;} apop_data_add_named_elmt(est->info, "AIC_c", AIC + 2*param_ct *(param_ct + 1.0)/(n - param_ct - 1.0)); Get_vmsizes(d); //vsize, msize1, tsize apop_data_add_named_elmt(est->info, "BIC", param_ct * log(n) - 2 *ll); } } static void auxinfo(apop_data *params, infostruct *i, int status, double ll){ apop_model *est = i->model; //just an alias. /* This catches too many near-misses if(est->constraint) apop_assert(!est->constraint(i->data, est), "the maximum likelihood search ended " "at a point that doesn't satisfy the model's constraints.");*/ if (i->want_cov=='y' && est->parameters){ apop_model_numerical_covariance(i->data, est, Apop_settings_get(est,apop_mle,delta)); if (i->want_tests=='y') apop_estimate_parameter_tests (est); } if (!est->info) est->info = apop_data_alloc(); apop_data_add_named_elmt(est->info, "status", status); if (i->want_info=='y') add_info_criteria(i->data, i->model, est, ll, i->beta->size); } static void apop_maximum_likelihood_w_d(apop_data * data, infostruct *i){ /* The maximum likelihood calculations, given a derivative of the log likelihood. If no derivative exists, will calculate a numerical gradient. Inside the infostruct, you'll find these elements: \param data the data matrix \param dist the \ref apop_model object: probit, zipf, &c. \param starting_pt an array of doubles suggesting a starting point. If NULL, use a vector whose elements are all 0.1 (zero has too many pathological cases). \param step_size the initial step size. \param tolerance the precision the minimizer uses. Only vaguely related to the precision of the actual var. \return an \ref apop_model with the parameter estimates, &c. If returned_estimate->status == 0, then optimum parameters were found; if status != 0, then there were problems. */ gsl_multimin_fdfminimizer *s; apop_model *est = i->model; //just an alias. apop_mle_settings *mp = apop_settings_get_group(est, apop_mle); int iter = 0, status = 0, apopstatus = 0, betasize= i->beta->size; if (!strcasecmp(mp->method, "BFGS cg")) s = gsl_multimin_fdfminimizer_alloc(gsl_multimin_fdfminimizer_vector_bfgs2, betasize); else if (!strcasecmp(mp->method, "PR cg")) s = gsl_multimin_fdfminimizer_alloc(gsl_multimin_fdfminimizer_conjugate_pr, betasize); else //Default: "FR CG" conjugate gradient (Fletcher-Reeves) s = gsl_multimin_fdfminimizer_alloc(gsl_multimin_fdfminimizer_conjugate_fr, betasize); gsl_multimin_function_fdf minme = { .f = negshell, .df = (apop_df_with_void) dnegshell, .fdf = (apop_fdf_with_void) fdf_shell, .n = betasize, .params = i}; ctrl_c = 0; if (setjmp(i->bad_eval_jump)) Apop_stopif(1, return, 0, "Failure evaluating likelihood at the starting point. Add a starting point?"); gsl_multimin_fdfminimizer_set (s, &minme, i->beta, mp->step_size, mp->tolerance); signal(SIGINT, mle_sigint); do { iter++; if (setjmp(i->bad_eval_jump)) { apopstatus = -1; break; } status = gsl_multimin_fdfminimizer_iterate(s); if(status && status!=GSL_CONTINUE) break; //commented out error msg because too many GSL_ENOPROG false positives. //Apop_stopif(status && status!=GSL_CONTINUE, break, 0, "GSL error: %s", gsl_strerror(status)); status = gsl_multimin_test_gradient(s->gradient, mp->tolerance); if(status && status!=GSL_CONTINUE) break; //commented out error msg because too many GSL_ENOPROG false positives. //Apop_stopif(status && status!=GSL_CONTINUE, break, 0, "GSL error: %s", gsl_strerror(status)); if (mp->verbose) printf ("%5i %.5f f()=%10.5f gradient=%.3f\n", iter, gsl_vector_get (s->x, 0), s->f, gsl_vector_get(s->gradient,0)); Apop_stopif(status == GSL_SUCCESS, apopstatus=0, 2, "Optimum found."); } while (status == GSL_CONTINUE && iter < mp->max_iterations && !ctrl_c); signal(SIGINT, NULL); Apop_stopif(iter==mp->max_iterations, apopstatus = -1, 1, "Max iterations reached, implying that I did not find an optimum."); //Clean up, copy results to output estimate. apop_data_unpack(s->x, est->parameters); gsl_multimin_fdfminimizer_free(s); gsl_vector_free(i->beta); auxinfo(est->parameters, i, apopstatus, i->best_ll); } /* See apop_maximum_likelihood_w_d for notes. */ static void apop_maximum_likelihood_no_d(apop_data * data, infostruct * i){ apop_model *est = i->model; apop_mle_settings *mp = apop_settings_get_group(est, apop_mle); int status=0, apopstatus = 0, iter = 0, betasize= i->beta->size; gsl_multimin_fminimizer *s; gsl_vector *ss; double size; s = gsl_multimin_fminimizer_alloc(gsl_multimin_fminimizer_nmsimplex, betasize); ss = gsl_vector_alloc(betasize); ctrl_c = apopstatus = 0; //assume failure until we score a success. gsl_vector_set_all (ss, mp->step_size); gsl_multimin_function minme = {.f = negshell, .n= betasize, .params = i}; if (setjmp(i->bad_eval_jump)) Apop_stopif(1, return, 0, "Failure evaluating likelihood at the starting point. Add a starting point?"); gsl_multimin_fminimizer_set (s, &minme, i->beta, ss); //i->beta = s->x; signal(SIGINT, mle_sigint); do { iter++; if (setjmp(i->bad_eval_jump)) { apopstatus = -1; break; } status = gsl_multimin_fminimizer_iterate(s); if (status) break; size = gsl_multimin_fminimizer_size(s); status = gsl_multimin_test_size (size, mp->tolerance); if(mp->verbose){ printf ("%5d ", iter); for (size_t j = 0; j < betasize; j++) printf ("%8.3e ", gsl_vector_get (s->x, j)); printf ("f()=%7.3f size=%.3f\n", s->fval, size); if (status == GSL_SUCCESS) { printf ("Optimum found at:\n"); printf ("%5d ", iter); for (size_t j = 0; j < betasize; j++) printf ("%8.3e ", gsl_vector_get (s->x, j)); printf ("f()=%7.3f size=%.3f\n", s->fval, size); } } } while (status == GSL_CONTINUE && iter < mp->max_iterations && !ctrl_c); signal(SIGINT, NULL); Apop_stopif(iter == mp->max_iterations && mp->verbose, /*continue*/, 1, "Optimization reached maximum number of iterations."); if (status == GSL_SUCCESS) apopstatus = 0; apop_data_unpack(s->x, est->parameters); gsl_multimin_fminimizer_free(s); auxinfo(est->parameters, i, apopstatus, i->best_ll); } /*There is a basically standard location for the log likelihood. Search there, and if you don't find it, then recalculate it.*/ static double get_ll(apop_data *d, apop_model *est){ int index = est->info ? apop_name_find(est->info->names, "log likelihood", 'r') : -2; if (index>-2) return apop_data_get(est->info, index); //last resort: recalculate return apop_log_likelihood(d, est); } static void dim_cycle(apop_data *d, apop_model *est, infostruct info){ double last_ll, this_ll = GSL_NEGINF; int iteration = 0; apop_mle_settings *mp = Apop_settings_get_group(est, apop_mle); double tol = mp->dim_cycle_tolerance; int betasize = info.beta->size; Apop_settings_set(est, apop_mle, dim_cycle_tolerance, 0);//so sub-estimations won't use this function. gsl_vector *paramv = apop_data_pack(est->parameters); apop_model *full_est = NULL; //an alias do { if (mp->verbose){ if (!(iteration++)) printf("Cycling toward an optimum. Listing (dim):log likelihood.\n"); printf("Iteration %i:\n", iteration); } last_ll = this_ll; for (int i=0; i< betasize; i++){ gsl_vector_set(info.beta, i, GSL_NAN); apop_data_unpack(info.beta, est->parameters); apop_model *m_onedim = apop_model_fix_params(est); apop_prep(d, m_onedim); apop_maximum_likelihood(d, m_onedim); gsl_vector_set(info.beta, i, m_onedim->parameters->vector->data[0]); full_est = apop_model_fix_params_get_base(m_onedim);//points to est, but filled. this_ll = get_ll(d, full_est);//only used on the last iteration. if (mp->verbose) printf("(%i):%g\t", i, this_ll), fflush(NULL); apop_model_free(m_onedim); } if (mp->verbose) printf("\n"); apop_data_pack(full_est->parameters, paramv); Apop_settings_add(est, apop_mle, starting_pt, paramv->data); } while (fabs(this_ll - last_ll) > tol); Apop_settings_set(est, apop_mle, dim_cycle_tolerance, tol); gsl_vector_free(paramv); } void get_desires(apop_model *m, infostruct *info){ apop_parts_wanted_settings *want = apop_settings_get_group(m, apop_parts_wanted); info->want_tests = (want && want->tests =='y') ? 'y' : 'n'; info->want_cov = (info->want_tests=='y' || (want && want->covariance =='y')) ? 'y' : 'n'; info->want_info = want ? (want->info =='y' ? 'y' : 'n') : 'y'; //doesn't do anything at the moment. info->want_predicted = (want && want->predicted =='y') ? 'y' : 'n'; } int check_method (char *m){ #define Onecheck(str) if (!strcasecmp(m, #str)) return 0; if(!m || strlen(m)==0) return 0; Onecheck(NM simplex) Onecheck(FR cg) Onecheck(BFGS cg) Onecheck(PR cg) Onecheck(Annealing) Onecheck(Newton) Onecheck(Newton hybrid) Onecheck(Newton hybrid no scale) return 1; } /** Find the likelihood-maximizing parameters of a model given data. \li I assume that \ref apop_prep has been called on your model. The easiest way to guarantee this is to use \ref apop_estimate, which calls this function if the input model has no \c estimate method. \li All of the settings are specified by adding a \ref apop_mle_settings struct to your model, so see the many notes there. Notably, the default method is the Fletcher-Reeves conjugate gradient method, and if your model does not have a dlog likelihood function, then a numeric gradient will be calculated via \ref apop_numerical_gradient. Add an \ref apop_mle_settings group to your model to set tuning parameters or select other methods, including the Nelder-Mead simplex, simulated annealing, and root-finding. \param data An \ref apop_data set. \param dist The \ref apop_model object: \ref apop_gamma, \ref apop_probit, \ref apop_zipf, &c. You can add an \c apop_mle_settings struct to it (Apop_model_add_group(your_model, apop_mle, .verbose=1, .method="PR cg", and_so_on)). \return None, but the input model is modified to include the parameter estimates, &c. \li There is auxiliary info in the ->info element of the post-estimation struct. Get elements via, e.g.: \code apop_model *est = apop_estimate(your_data, apop_probit); int status = apop_data_get(est->info, .rowname="status"); if (status) //trouble else //optimum found apop_data_print(est->parameters); //Here are the estimated parameters \endcode \li During the search for an optimum, ctrl-C (SIGINT) will halt the search, and the function will return whatever parameters the search was on at the time. */ void apop_maximum_likelihood(apop_data * data, apop_model *dist){ apop_mle_settings *mp = apop_settings_get_group(dist, apop_mle); if (!mp) mp = Apop_model_add_group(dist, apop_mle); apop_score_type ms = apop_score_vtable_get(dist); Apop_stopif(check_method(mp->method), return, 0, "You set the method='%s', " "which is not on my list of allowable methods. See the apop_mle_settings " "documentation for the list of options", mp->method); if (!mp->method || !strlen(mp->method)) mp->method = ms ? "FR cg" : "NM simplex"; Apop_stopif(!dist->parameters, dist->error='p'; return, 0, "Not enough information to allocate parameters over which to optimize. If this was not called from apop_estimate, did you call apop_prep first?"); infostruct info = {.data = data, .use_constraint = 1, .path = mp->path, .model = dist}; get_desires(dist, &info); info.beta = apop_data_pack(dist->parameters); if (setup_starting_point(mp, info.beta)) return; info.model->data = data; if (mp->dim_cycle_tolerance) dim_cycle(data, dist, info); else if (!strcasecmp(mp->method, "annealing")) apop_annealing(&info); //below. else if (!strcasecmp(mp->method, "NM simplex")) apop_maximum_likelihood_no_d(data, &info); else if (!strcasecmp(mp->method, "Newton") || !strcasecmp(mp->method, "Newton hybrid")|| !strcasecmp(mp->method, "Newton hybrid no scale")) find_roots (info); else /* Conjugate Gradient*/ apop_maximum_likelihood_w_d(data, &info); } /** Maximum likelihod searches are not guaranteed to find a global optimum, and it can be difficult to tune a search such that it covers a wide space, but also accurately hones in on the optimum. In both cases, one could restart the search using a different starting point or different parameters. The simplest use of this function is to restart a model at the latest parameter estimates. \code apop_model *m = apop_estimate(data, model_using_an_MLE_search); for (int i=0; i< 10; i++) m = apop_estimate_restart(m); apop_data_show(m); \endcode By adding a line to reduce the tolerance each round [e.g., Apop_settings_set(m, apop_mle, tolerance, pow(10,-i))], you can start broad and hone in on a precise optimum. You may have a new estimation method, such as first doing a coarse simulated annealing search, then a fine conjugate gradient search. When reading this example, recall that the form for adding a new settings group differs from the form for modifying existing settings: \code Apop_model_add_settings(your_base_model, apop_mle, .method=APOP_SIMAN); apop_model *m = apop_estimate(data, your_base_model); Apop_settings_set(m, apop_mle, method, APOP_CG_PR); m = apop_estimate_restart(m); apop_data_show(m); \endcode Only one estimate is returned, either the one you sent in or a new one. The loser (which may be the one you sent in) is freed, to prevent memory leaks. \param e An \ref apop_model that is the output from a prior MLE estimation. (No default, must not be \c NULL.) \param copy Another not-yet-parametrized model that will be re-estimated with (1) the same data and (2) a starting_pt as per the next setting (probably to the parameters of e). If this is NULL, then copy e. (Default = \c NULL) \param starting_pt "ep"=last estimate of the first model (i.e., its current parameter estimates)
"es"= starting point originally used by the first model
"np"=current parameters of the new (second) model
"ns"=starting point specified by the new model's MLE settings. (default = "ep") \param boundary I test whether the starting point you give me has magintude greater than this bound, so I can warn you if there's divergence in your sequence of re-estimations. (default: 1e8) \return If the new estimated parameters include any NaNs/Infs, then the old estimate is returned (even if the old estimate included NaNs/Infs). Otherwise, the estimate with the largest log likelihood is returned. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_model * apop_estimate_restart(apop_model *e, apop_model *copy, char * starting_pt, double boundary){ #else apop_varad_head(apop_model *, apop_estimate_restart){ apop_model * apop_varad_var(e, NULL); Nullcheck_m(e, NULL); apop_model * apop_varad_var(copy, NULL); char * apop_varad_var(starting_pt, "ep"); double apop_varad_var(boundary, 1e8); return apop_estimate_restart_base(e, copy, starting_pt, boundary); } apop_model * apop_estimate_restart_base(apop_model *e, apop_model *copy, char * starting_pt, double boundary){ #endif gsl_vector *v = NULL; if (!copy) copy = apop_model_copy(e); apop_mle_settings* prm0 = apop_settings_get_group(e, apop_mle); apop_mle_settings* prm = apop_settings_get_group(copy, apop_mle); //copy off the old params; modify the starting pt, method, and scale if (!strcmp(starting_pt, "es")) v = apop_array_to_vector(prm0->starting_pt); else if (!strcmp(starting_pt, "ns")){ int size =sizeof(prm->starting_pt)/sizeof(double); v = apop_array_to_vector(prm->starting_pt, size); prm0->starting_pt = malloc(sizeof(double)*size); memcpy(prm0->starting_pt, prm->starting_pt, sizeof(double)*size); } else if (!strcmp(starting_pt, "np")){ v = apop_data_pack(copy->parameters); prm->starting_pt = malloc(sizeof(double)*v->size); memcpy(prm->starting_pt, v->data, sizeof(double)*v->size); } else if (e->parameters){//"ep" or default. v = apop_data_pack(e->parameters); prm->starting_pt = malloc(sizeof(double)*v->size); memcpy(prm->starting_pt, v->data, sizeof(double)*v->size); } Apop_stopif(!apop_vector_bounded(v, boundary), return e, 0, "Your model has diverged (element(s) > %g);" " returning your original model without restarting.", boundary); gsl_vector_free(v); apop_model *newcopy = apop_estimate(e->data, copy); apop_model_free(copy); //Now check whether the new output is better than the old if (apop_vector_bounded(newcopy->parameters->vector, boundary) && get_ll(e->data, newcopy) > get_ll(e->data, e)){ apop_model_free(e); return newcopy; } //else: apop_model_free(newcopy); return e; } // Simulated Annealing. static double annealing_energy(void *in) { infostruct *i = in; return negshell(i->beta, i); } static double annealing_distance(void *xin, void *yin) { /** We use the Manhattan metric to correspond to the annealing_step fn below. */ gsl_vector *from = apop_vector_copy(((infostruct*)xin)->beta); gsl_vector *to = apop_vector_copy(((infostruct*)yin)->beta); gsl_vector_div(from, ((infostruct*)xin)->starting_pt); gsl_vector_div(to, ((infostruct*)xin)->starting_pt);//starting pts are the same. return apop_vector_distance(from, to, .metric='m'); } static void annealing_check_constraint(infostruct *i){ apop_data_unpack(i->beta, i->model->parameters); if (i->model->constraint && i->model->constraint(i->data, i->model)) apop_data_pack(i->model->parameters, i->beta); } static void annealing_step(const gsl_rng * r, void *in, double step_size){ /** The algorithm: --randomly pick dimension --shift by some amount of remaining step size --repeat for all dims This will give a move \f$\leq\f$ step_size on the Manhattan metric. */ infostruct *i = in; int sign; double amt, scale; double cutpoints[i->beta->size+1]; cutpoints[0] = 0; cutpoints[i->beta->size] = 1; for (size_t j=1; j< i->beta->size; j++) cutpoints[j] = gsl_rng_uniform(r); for (size_t j=0; j< i->beta->size; j++){ sign = (gsl_rng_uniform(r) > 0.5) ? 1 : -1; scale = gsl_vector_get(i->starting_pt, j); amt = cutpoints[j+1]- cutpoints[j]; *gsl_vector_ptr(i->beta, j) += amt * sign * scale * step_size; } annealing_check_constraint(i); } static void annealing_print(void *xp) { apop_vector_show(((infostruct*)xp)->beta); } static void annealing_print2(void *xp) { return; } static void annealing_memcpy(void *xp, void *yp){ infostruct *yi = yp; infostruct *xi = xp; *yi = *xi; yi->beta = apop_vector_copy(xi->beta); } static void *annealing_copy(void *xp){ infostruct *out = malloc(sizeof(infostruct)); annealing_memcpy(xp, out); return out; } static void annealing_free(void *xp){ gsl_vector_free(((infostruct*)xp)->beta); free(xp); } //I abuse the starting point element to hold the list of scaling factors. They can't be zero. static double set_start(double in){ return in ? in : 1; } jmp_buf anneal_jump; static void anneal_sigint(){ longjmp(anneal_jump,1); } static void apop_annealing(infostruct *i){ apop_model *ep = i->model; apop_mle_settings *mp = apop_settings_get_group(ep, apop_mle); Apop_stopif(!mp, ep->error='l'; return, 0, "The model you sent to the MLE function has neither log_likelihood element nor p element."); gsl_siman_params_t simparams = (gsl_siman_params_t) { .n_tries = mp->n_tries, .iters_fixed_T = mp->iters_fixed_T, .step_size = mp->step_size, .k = mp->k, .t_initial = mp->t_initial, .mu_t = mp->mu_t, .t_min = mp->t_min}; gsl_rng *r = mp->rng ? mp->rng : apop_rng_get_thread(); //these two are done at apop_maximum_likelihood: //i->beta = apop_data_pack(ep->parameters); //setup_starting_point(mp, i->beta); int betasize = i->beta->size; int apopstatus = -1; i->starting_pt = apop_vector_map(i->beta, set_start); i->use_constraint = 0; //negshell doesn't check it; annealing_step does. gsl_siman_print_t printing_fn = NULL; if (mp && mp->verbose>1) printing_fn = annealing_print; else if (mp && mp->verbose) printing_fn = annealing_print2; annealing_check_constraint(i); //shift starting point if needed. if (setjmp(i->bad_eval_jump)) { apopstatus = -1; goto done; } if (!setjmp(anneal_jump)){ signal(SIGINT, anneal_sigint); gsl_siman_solve(r, // const gsl_rng * r i, // void * x0_p annealing_energy, // gsl_siman_Efunc_t Ef annealing_step, // gsl_siman_step_t take_step annealing_distance, // gsl_siman_metric_t distance printing_fn, // gsl_siman_print_t print_position annealing_memcpy, // gsl_siman_copy_t copyfunc annealing_copy, // gsl_siman_copy_construct_t copy_constructor annealing_free, // gsl_siman_destroy_t destructor betasize, // size_t element_size simparams); // gsl_siman_params_t params } signal(SIGINT, NULL); apop_data_unpack(i->beta, i->model->parameters); apop_estimate_parameter_tests(i->model); apopstatus = 0; done: if (mp->rng) r = NULL; auxinfo(i->model->parameters, i, apopstatus, i->best_ll); } /* This function calls the various GSL root-finding algorithms to find the zero of the score. Cut/pasted/modified from the GSL documentation. */ static apop_model * find_roots (infostruct p) { const gsl_multiroot_fsolver_type *T; gsl_multiroot_fsolver *s; apop_model *dist = p.model; apop_mle_settings *mlep = apop_settings_get_group(dist, apop_mle); int status=0, betasize = p.beta->size, apopstatus = -1; //assume failure until we score a success. size_t iter = 0; gsl_multiroot_function f = {dnegshell, betasize, &p}; T = !strcasecmp(mlep->method, "Newton") ? gsl_multiroot_fsolver_dnewton : !strcasecmp(mlep->method, "Newton hybrid no scale") ? gsl_multiroot_fsolver_hybrids : gsl_multiroot_fsolver_hybrid; s = gsl_multiroot_fsolver_alloc (T, betasize); gsl_multiroot_fsolver_set (s, &f, p.beta); do { iter++; if (setjmp(p.bad_eval_jump)) break; status = gsl_multiroot_fsolver_iterate (s); if (!mlep || mlep->verbose) printf ("iter = %3zu x = % .3f f(x) = % .3e\n", iter, gsl_vector_get (s->x, 0), gsl_vector_get (s->f, 0)); if (status) /* check if solver is stuck */ break; status = gsl_multiroot_test_residual (s->f, mlep->tolerance); } while (status == GSL_CONTINUE && iter < mlep->max_iterations); if (GSL_SUCCESS) apopstatus = 0; Apop_notify(2, "status = %s\n", gsl_strerror(status)); apop_data_unpack(s->x, dist->parameters); gsl_multiroot_fsolver_free (s); gsl_vector_free (p.beta); auxinfo(dist->parameters, &p, apopstatus, 0); //root-finders don't store best val. return dist; } apophenia-1.0+ds/apop_model.c000066400000000000000000000642011262736346100162200ustar00rootroot00000000000000 /** \file apop_model.c sets up the estimate structure which outputs from the various regressions and MLEs.*/ /* Copyright (c) 2006--2011 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #define Declare_type_checking_fns #include "apop_internal.h" /** Set up the \c parameters and \c info elements of the \c apop_model: At close, the input model has parameters of the correct size. \li This is the default action for \ref apop_prep, and many models with a custom prep routine call \ref apop_model_clear at the end. Also, \ref apop_estimate calls this function internally, which means that you robably never have to call this function directly. \li If the model has already been prepped, this function should be a no-op. \param data If your params vary with the size of the data set, then the function needs a data set to calibrate against. Otherwise, it's OK to set this to \c NULL. \param model The model whose output elements will be modified. \return A pointer to the same model, should you need it. \exception outmodel->error=='d' dimension error. */ apop_model * apop_model_clear(apop_data * data, apop_model *model){ Get_vmsizes(data) int width = msize2 ? msize2 : -firstcol;//use the vector only if there's no matrix. Apop_stopif(model->dsize==-1 && !width, model->error='d', 0, "The model's dsize==-1, meaning size=data width, but the input data has NULL vector and matrix."); Apop_stopif(model->vsize==-1 && !width, model->error='d', 0, "The model's vsize==-1, meaning size=data width, but the input data has NULL vector and matrix."); Apop_stopif(model->msize1==-1 && !width, model->error='d', 0, "The model's msize1==-1, meaning size=data width, but the input data has NULL vector and matrix."); Apop_stopif(model->msize2==-1 && !width, model->error='d', 0, "The model's msize2==-1, meaning size=data width, but the input data has NULL vector and matrix."); model->dsize = (model->dsize == -1 ? width : model->dsize); vsize = model->vsize == -1 ? width : model->vsize; msize1 = model->msize1 == -1 ? width : model->msize1 ; msize2 = model->msize2 == -1 ? width : model->msize2 ; if (!model->parameters && (vsize || msize1*msize2)) model->parameters = apop_data_alloc(vsize, msize1, msize2); if (!model->info) model->info = apop_data_alloc(); if (model->info->names->title && !strlen(model->info->names->title)) free(model->info->names->title); Asprintf(&model->info->names->title, ""); if (!model->data) model->data = data; return model; } /** Free an \ref apop_model structure. \li The \c parameters and \c settings are freed. These are the elements that are copied by \c apop_model_copy. \li The \c data element is not freed, because the odds are you still need it. \li If free_me->more_size is positive, the function runs free(free_me->more). But it has no idea what the \c more element contains; if it points to other structures (like an \ref apop_data set), you need to free them before calling this function. \li If \c free_me is \c NULL, this does nothing. \param free_me A pointer to the model to be freed. */ void apop_model_free (apop_model * free_me){ if (!free_me) return; apop_data_free(free_me->parameters); if (free_me->settings){ int i=0; while (free_me->settings[i].name[0]){ if (free_me->settings[i].free) ((void (*)(void*))(free_me->settings[i].free))(free_me->settings[i].setting_group); i++; } free(free_me->settings); } if (free_me->more_size) free(free_me->more); if (free_me->info) apop_data_free(free_me->info); free(free_me); } /** Print the results of an estimation for a human to look over. \param model The model whose information should be displayed (No default. If \c NULL, print NULL) \param output_pipe The output stream. Default: \c stdout. If you'd like something else, use \c fopen. E.g.: \code FILE *out =fopen("outfile.txt", "w"); //or "a" to append. apop_model_print(the_model, out); fclose(out); //optional in many cases. \endcode \li The default prints the name, parameters, info, &c. but I check a vtable for alternate methods you define; see \ref vtables for details. The typedef new functions must conform to and the hash used for lookups are: \code typedef void (*apop_model_print_type)(apop_model *params, FILE *out); #define apop_model_print_hash(m1) ((m1)->log_likelihood ? (size_t)(m1)->log_likelihood : \ (m1)->p ? (size_t)(m1)->p*33 : \ (m1)->estimate ? (size_t)(m1)->estimate*33*33 : \ (m1)->draw ? (size_t)(m1)->draw*33*27 : \ (m1)->cdf ? (size_t)(m1)->cdf*27*27 : 27) \endcode When building a special print method, all output should \c fprintf to the input \c FILE* handle. Apophenia's output routines also accept a file handle; e.g., if the file handle is named \c out, then if the \c thismodel print method uses \c apop_data_print to print the parameters, it must do so via a form like apop_data_print(thismodel->parameters, .output_pipe=ap). Your \c print method can use both by masking itself for a few lines: \code void print_method(apop_model *in, FILE* ap){ void *temp = in->estimate; in->estimate = NULL; apop_model_print(in, ap); in->estimate = temp; printf("Additional info:\n"); ... } \endcode \li Print methods are intended for human consumption and are subject to change. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC void apop_model_print(apop_model * model, FILE *output_pipe){ #else apop_varad_head(void, apop_model_print){ FILE * apop_varad_var(output_pipe, stdout); apop_model* apop_varad_var(model, NULL); if (!model) {fprintf(output_pipe, "NULL\n"); return;} apop_model_print_base(model, output_pipe); } void apop_model_print_base(apop_model * model, FILE *output_pipe){ #endif apop_model_print_type mpf = apop_model_print_vtable_get(model); if (mpf){ mpf(model, output_pipe); return; } if (strlen(model->name)) fprintf (output_pipe, "%s", model->name); fprintf(output_pipe, "\n\n"); if (model->parameters) apop_data_print(model->parameters, .output_pipe=output_pipe); Get_vmsizes(model->info); //maxsize if (model->info && maxsize) apop_data_print(model->info, .output_pipe=output_pipe); } /* Alias for \ref apop_model_print. Use that one. */ void apop_model_show (apop_model * print_me){ apop_model_print(print_me, NULL); } /** Outputs a copy of the \ref apop_model input. \param in The model to be copied \return A copy of the original. Includes copies of all settings groups, and the \c parameters (if not \c NULL, copied via \ref apop_data_copy). \li If in.more_size > 0 I memcpy the \c more pointer from the original data set. \li The data set at \c in->data is not copied, but is also pointed to. \exception out->error=='a' Allocation error. In extreme cases, where there aren't even a few hundred bytes available, I will return \c NULL. \exception out->error=='s' Error copying settings groups. \exception out->error=='p' Error copying parameters or info page; the given \ref apop_data struct may be \c NULL or may have its own ->error element. */ apop_model * apop_model_copy(apop_model *in){ Apop_stopif(!in, return NULL, 1, "Copying a NULL input; returning NULL."); apop_model * out = malloc(sizeof(apop_model)); Apop_stopif(!out, return NULL, 0, "Serious allocation error; returning NULL."); memcpy(out, in, sizeof(apop_model)); if (in->more_size){ out->more = malloc(in->more_size); Apop_stopif(!out->more, out->error='a'; return out, 0, "Allocation error setting up the ->more pointer."); memcpy(out->more, in->more, in->more_size); } int i=0; out->settings = NULL; if (in->settings) do apop_settings_copy_group(out, in, in->settings[i].name); while (strlen(in->settings[i++].name)); out->parameters = apop_data_copy(in->parameters); Apop_stopif(in->parameters && (!out->parameters || out->parameters->error), out->error='p'; return out, 0, "Error copying the model parameters."); out->info = apop_data_copy(in->info); Apop_stopif(in->info && (!out->info || out->info->error), out->error='p'; return out, 0, "Error copying the info segment."); return out; } /** \def apop_model_set_parameters(in, ...) Take in an unparameterized \c apop_model and return a new \c apop_model with the given parameters. For example, if you need a N(0,1) quickly: \code apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1); \endcode This doesn't take in data, so it won't work with models that take the number of parameters from the data, and it will only set the vector of the model's parameter \ref apop_data set. This is most standard models. If you have a situation where these options are out, you could \li manually set Set \c .vsize and/or \c .msize1 and \c .msize2 first, then call this function, or \li prep the model via something like apop_model *new = apop_model_copy(in); apop_prep(your_data, new); (because \ref apop_prep is required to correctly allocate \c new->parameters to conform to your data). \param in An unparameterized model, like \ref apop_normal or \ref apop_poisson. \param ... The list of parameters. \return A copy of the input model, with parameters set. \exception out->error=='d' dimension error: you gave me a model with an indeterminate number of parameters. See notes above. Set \c .vsize or \c .msize1 and \c .msize2 first, then call this function, or use apop_model *new = apop_model_copy(in); apop_prep(your_data, new); and then call this . \see apop_data_fill \hideinitializer */ apop_model *apop_model_set_parameters_base(apop_model *in, double ap[]){ apop_model *out = apop_model_copy(in); apop_prep(NULL, out); Apop_stopif((in->vsize == -1) || (in->msize1 == -1) || (in->msize2 == -1), out->error='d', 0, "This function only works with models whose number of params does not " "depend on data size. You'll have to use apop_model *new = apop_model_copy(in); " " apop_model_clear(your_data, in); and then set in->parameters using your data."); apop_data_fill_base(out->parameters, ap); return out; } /** Estimate the parameters of a model given data. This function copies the input model, preps it (see \ref apop_prep), and calls \c m.estimate(d, m) (which users are encouraged to never call directly). If your model has no \c estimate method, then call \c apop_maximum_likelihood(d, m), with the default MLE settings. \param d The data \param m The model \return A pointer to an output model, which typically matches the input model but has its \c parameters element filled in. */ apop_model *apop_estimate(apop_data *d, apop_model *m){ apop_model *out = apop_model_copy(m); apop_prep(d, out); if (out->estimate) out->estimate(d, out); else apop_maximum_likelihood(d, out); return out; } /** Find the probability of a data/parametrized model pair. \param d The data \param m The parametrized model, which must have either a \c log_likelihood or a \c p method. */ double apop_p(apop_data *d, apop_model *m){ Nullcheck_m(m, GSL_NAN); if (m->p) return m->p(d, m); else if (m->log_likelihood) return exp(m->log_likelihood(d, m)); Apop_stopif(0, , 0, "You asked for the probability of a model that has neither p nor log_likelihood methods."); return GSL_NAN; } /** Find the log likelihood of a data/parametrized model pair. \param d The data \param m The parametrized model, which must have either a \c log_likelihood or a \c p method. */ double apop_log_likelihood(apop_data *d, apop_model *m){ Nullcheck_m(m, GSL_NAN); //Nullcheck_p(m); //Too many models don't use the params. if (m->log_likelihood) return m->log_likelihood(d, m); else if (m->p) return log(m->p(d, m)); Apop_stopif(0, , 0, "You asked for the log likelihood of a model that has neither p nor log_likelihood methods."); return GSL_NAN; } /** Find the vector of first derivatives (aka the gradient) of the log likelihood of a data/parametrized model pair. On input, the model \c m must already be sufficiently prepped that the log likelihood can be evaluated; see \ref psubsection for details. On output, the \c gsl_vector input to the function will be filled with the gradients (or NaNs on errors). If the model parameters have a more complex shape than a simple vector, then the vector will be in \c apop_data_pack order; use \c apop_data_unpack to reformat to the preferred shape. \param d The \ref apop_data set at which the score is being evaluated. \param out The score to be returned. I expect you to have allocated this already. \param m The parametrized model, which must have either a \c log_likelihood or a \c p method. \li The default is to use \ref apop_numerical_gradient, but special-case calculations for certain models are held in a vtable; see \ref vtables for details. The typedef new functions must conform to and the hash used for lookups are: \code typedef void (*apop_score_type)(apop_data *d, gsl_vector *gradient, apop_model *m); #define apop_score_hash(m1) ((size_t)((m1).log_likelihood ? (m1).log_likelihood : (m1).p)) \endcode */ void apop_score(apop_data *d, gsl_vector *out, apop_model *m){ Nullcheck_m(m, ); Apop_stopif(!out, return, 0, "out vector is NULL. It must be pre-allocated to the correct size. E.g., gsl_vector *out = gsl_vector_alloc(m->vsize + m->size1*m->size2)))."); apop_score_type ms = apop_score_vtable_get(m); if (ms){ ms(d, out, m); return; } gsl_vector * numeric_default = apop_numerical_gradient(d, m); gsl_vector_memcpy(out, numeric_default); gsl_vector_free(numeric_default); } Apop_settings_init(apop_pm, //defaults include base=NULL, index=0, own_rng=0 Apop_varad_set(rng, NULL); Apop_varad_set(draws, 1e4); ) Apop_settings_copy(apop_pm,) Apop_settings_free(apop_pm, ) void distract_doxygen(){/*Doxygen gets thrown by the settings macros. This decoy function is a workaround. */} /** Get a model describing the distribution of the given parameter estimates. For many models, the parameter estimates are well-known, such as the \f$t\f$-distribution of the parameters for OLS. For models where the distribution of \f$\hat{p}\f$ is not known, if you give me data, I will return an \ref apop_normal or \ref apop_multivariate_normal model, using the parameter estimates as mean and \ref apop_bootstrap_cov for the variances. If you don't give me data, then I will assume that this is a stochastic model where re-running the model will produce different parameter estimates each time. In this case, I will run the model 1e4 times and return a \ref apop_pmf model with the resulting parameter distributions. Before calling this, I expect that you have already run \ref apop_estimate to produce \f$\hat{p}\f$. The \ref apop_pm_settings structure dictates details of how the model is generated. For example, if you want only the distribution of the third parameter, and you know the distribution will be a PMF generated via random draws, then set settings and call the model via: \code apop_model_group_add(your_model, apop_pm, .index =3, .draws=3e5); apop_model *dist = apop_parameter_model(your_data, your_model); \endcode Some useful parts of \ref apop_pm_settings: \li \c index gives the position of the parameter (in \ref apop_data_pack order) in which you are interested. Thus, if this is zero or more, then you will get a univariate output distribution describing a single parameter. If index == -1, then I will give you the multivariate distribution across all parameters. The default is zero (i.e. the univariate distribution of the zeroth parameter). \li \c draws If there is no closed-form solution and bootstrap is inappropriate, then the last resort is a large numbr of random draws of the model, summarized into a PMF. Default: 1,000 draws. \li \c rng If the method requires random draws, then use this. If you provide \c NULL and one is needed, I provide one for you via \ref apop_rng_get_thread. The default is via resampling as above, but special-case calculations for certain models are held in a vtable; see \ref vtables for details. The typedef new functions must conform to and the hash used for lookups are: \code typedef apop_model* (*apop_parameter_model_type)(apop_data *, apop_model *); #define apop_parameter_model_hash(m1) ((size_t)((m1).log_likelihood ? (m1).log_likelihood : (m1).p)*33 + (m1).estimate ? (size_t)(m1).estimate: 27) \endcode */ apop_model *apop_parameter_model(apop_data *d, apop_model *m){ apop_pm_settings *settings = apop_settings_get_group(m, apop_pm); if (!settings) settings = Apop_settings_add_group(m, apop_pm, .base= m); apop_parameter_model_type pm = apop_parameter_model_vtable_get(m); if (pm) return pm(d, m); else if (d){ Get_vmsizes(m->parameters);//vsize, msize1, msize2 apop_model *out = apop_model_copy(apop_multivariate_normal); out->msize1 = out->vsize = out->msize2 = out->dsize = vsize+msize1+msize2; out->parameters = apop_bootstrap_cov(d, m, settings->rng, settings->draws); out->parameters->vector = apop_data_pack(m->parameters); if (settings->index == -1) return out; else { apop_model *out2 = apop_model_set_parameters(apop_normal, apop_data_get(out->parameters, settings->index, -1), //mean apop_data_get(out->parameters, settings->index, settings->index)//var ); apop_model_free(out); return out2; } } //else Get_vmsizes(m->parameters);//vsize, msize1, msize2 apop_data *param_draws = apop_data_alloc(0, settings->draws, vsize+msize1+msize2); for (int i=0; i < settings->draws; i++){ apop_model *mm = apop_estimate (NULL, m);//If you're here, d==NULL. apop_data_pack(mm->parameters, Apop_rv(param_draws, i)); apop_model_free(mm); } if (settings->index == -1) return apop_estimate(param_draws, apop_pmf); else { apop_data *param_draws1 = apop_data_alloc(settings->draws, 0,0); gsl_vector *the_draws = Apop_cv(param_draws, settings->index); gsl_vector_memcpy(param_draws1->vector, the_draws); apop_data_free(param_draws); return apop_estimate(param_draws1, apop_pmf); } } extern apop_model *apop_swap_model; //apop_missing_data.c int apop_model_metropolis_draw(double *out, gsl_rng* rng, apop_model *params);//apop_update.c /** Draw from a model. \param out An already-allocated array of doubles to be filled by the draw method. It must have size m->dsize. \param r A \c gsl_rng, probably allocated via \ref apop_rng_alloc. Optional; if \c NULL, then I will call \ref apop_rng_get_thread for an RNG. \param m The model from which to make draws. \li If the model has its own \c draw method, then this function will call it. \li Else, if the model is univariate, use \ref apop_arms_draw to generate random draws. \li Else, if the model is multivariate, use \ref apop_model_metropolis to generate random draws. \li This makes a single draw of the given size. See \ref apop_model_draws to fill a matrix with draws. \return Zero on success; nozero on failure. out[0] is probably \c NAN on failure. */ int apop_draw(double *out, gsl_rng *r, apop_model *m){ if (!r) r = apop_rng_get_thread(-1); if (m->draw) return m->draw(out, r, m); else if (m->dsize == 1) return apop_arms_draw(out, r, m); //Else, MCMC, possibly setting it up first. //generate a model with data/params reversed //estimate mcmc. Swapped model will be stored as settings->base_model. OMP_critical (apop_draw) if (!Apop_settings_get_group(m, apop_mcmc)){ apop_model *swapped = apop_model_copy(apop_swap_model); swapped->more = m; swapped->msize1 = 1; swapped->msize2 = m->dsize; swapped->data = m->parameters; Apop_settings_add_group(swapped, apop_mcmc, .burnin=0.999, .periods=1000); apop_model *est = apop_model_metropolis(m->parameters, r, swapped); //leak. m->draw = apop_model_metropolis_draw; apop_settings_copy_group(m, est, "apop_mcmc"); } return apop_draw(out, r, m); } /** Allocate and initialize the \c parameters, \c info, and other requisite parts of a \ref apop_model. Some models have associated prep routines that also attach settings groups to the model, and set up additional special-case functions in vtables. \li The input model is modified in place. \li If called repeatedly, subsequent calls to \ref apop_prep are no-ops. Thus, a model can not be re-prepped using a new data set or other conditions. \li The default prep is to simply call \ref apop_model_clear. If the input \ref apop_model has a prep method, then that gets called instead. */ void apop_prep(apop_data *d, apop_model *m){ if (m->prep) m->prep(d, m); else apop_model_clear(d, m); } static double disnan(double in) {return gsl_isnan(in);} /** A prediction supplies E(a missing value | original data, already-estimated parameters, and other supplied data elements ). For a regression, one would first estimate the parameters of the model, then supply a row of predictors X. The value of the dependent variable \f$y\f$ is unknown, so the system would predict that value. For a univariate model (i.e. a model in one-dimensional data space), there is only one variable to omit and fill in, so the prediction problem reduces to the expected value: E(a missing value | original data, already-estimated parameters). [In some models, this may not be the expected value, but is a best value for the missing item using some other meaning of `best'.] In other cases, prediction is the missing data problem: for three-dimensional data, you may supply the input (34, \c NaN, 12), and the parameterized model provides the most likely value of the middle parameter given the parameters and known data. \li If you give me a \c NULL data set, I will assume you want all values filled in, for most models with the expected value. \li If you give me data with \c NaNs, I will take those as the points to be predicted given the provided data. If the model has no \c predict method, the default is to use the \ref apop_ml_impute function to do the work. That function does a maximum-likelihood search for the best parameters. \return If you gave me a non-\c NULL data set, I will return that, with the \c NaNs filled in. If \c NULL input, I will allocate an \ref apop_data set and fill it with the expected values. There may be a second page (i.e., a \ref apop_data set attached to the ->more pointer of the main) listing confidence and standard error information. See your specific model documentation for details. \li Special-case calculations for certain models are held in a vtable; see \ref vtables for details. The typedef new functions must conform to and the hash used for lookups are: \code typedef apop_data * (*apop_predict_type)(apop_data *d, apop_model *params); #define apop_predict_hash(m1) ((size_t)((m1).log_likelihood ? (m1).log_likelihood : (m1).p)*33 + (m1).estimate ? (size_t)(m1).estimate: 27) \endcode */ apop_data *apop_predict(apop_data *d, apop_model *m){ apop_data *prediction = NULL; apop_data *out = d ? d : apop_data_alloc(0, 1, m->dsize); if (!d) gsl_matrix_set_all(out->matrix, GSL_NAN); apop_predict_type mp = apop_predict_vtable_get(m); if (mp) prediction = mp(out, m); if (prediction) return prediction; if (!apop_map_sum(out, disnan)) return out; //default: apop_model *f = apop_ml_impute(out, m); apop_model_free(f); return out; } /* Are all the elements of v less than or equal to the corresponding elements of the reference vector? */ static int lte(gsl_vector *v, gsl_vector *ref){ for (int i=0; i< v->size; i++) if(v->data[i] > gsl_vector_get(ref, i)) return 0; return 1; } /** Input a one-row data point/vector and a model; returns the area of the model's PDF beneath the given point. By default, make random draws from the PDF and return the percentage of those draws beneath or equal to the given point. Many models have closed-form solutions that make no use of random draws. See also \ref apop_cdf_settings, which is the structure used to store draws already made (which means the second, third, ... calls to this function will take much less time than the first), the \c gsl_rng, and the number of draws to be made. These are handled without your involvement, but if you would like to change the number of draws from the default, add this group before calling \ref apop_cdf : \code Apop_model_add_group(your_model, apop_cdf, .draws=1e5, .rng=my_rng); double cdf_value = apop_cdf(your_data_point, your_model); \endcode \li Only the first row of the input \ref apop_data set is used. Note that if you need to view row 20 of a data set as a one-row data set, use \ref Apop_r. Here are many examples using common, mostly symmetric distributions. \include some_cdfs.c */ double apop_cdf(apop_data *d, apop_model *m){ if (m->cdf) return m->cdf(d, m); apop_cdf_settings *cs = Apop_settings_get_group(m, apop_cdf); if (!cs) cs = Apop_model_add_group(m, apop_cdf); long int tally = 0; gsl_vector *ref = apop_data_pack(Apop_r(d, 0)); if (!cs->draws_made){ if (m->dsize == -1) apop_prep(d, m); Apop_stopif(m->dsize==0, return GSL_NAN, 0, "I need to make random draws from your model, but it has dsize==0. Returning NaN"); cs->draws_made = gsl_matrix_alloc(cs->draws, m->dsize); for (int i=0; i< cs->draws; i++) apop_draw((Apop_mrv(cs->draws_made, i))->data, cs->rng, m); } for (int i=0; i< cs->draws_made->size1; i++) tally += lte(Apop_mrv(cs->draws_made, i), ref); gsl_vector_free(ref); return tally/(double)cs->draws_made->size1; } Apop_settings_init(apop_cdf, Apop_varad_set(draws, 1e4); Apop_varad_set(rng, NULL); out->draws_refcount = malloc(sizeof(int)); *out->draws_refcount = 1; ) Apop_settings_free(apop_cdf, if (in->draws_made && !--*in->draws_refcount) gsl_matrix_free(in->draws_made); ) Apop_settings_copy(apop_cdf, ++*out->draws_refcount; ) apophenia-1.0+ds/apop_name.c000066400000000000000000000176031262736346100160440ustar00rootroot00000000000000 /** \file apop_name.c */ /* Copyright (c) 2006--2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include #include /** Allocates a name structure \return An allocated, empty name structure. In the very unlikely event that \c malloc fails, return \c NULL. Because \ref apop_data_alloc uses this to set up its output, you will rarely if ever need to call this function explicitly. You may want to use it if wrapping a \c gsl_matrix into an \ref apop_data set. For example, to put a title on a vector: \code apop_data *d = &(apop_data){.vector=your_vector, .names=apop_name_alloc()}; apop_name_add(d->names, "A column of numbers", 'v'); apop_data_print(d); ... apop_name_free(d->names); //but d itself is auto-allocated; no need to free it. \endcode */ apop_name * apop_name_alloc(void){ apop_name * init_me = malloc(sizeof(apop_name)); Apop_stopif(!init_me, return NULL, 0, "malloc failed. Probably out of memory."); *init_me = (apop_name){ }; return init_me; } /** Adds a name to the \ref apop_name structure. Puts it at the end of the given list. \param n An existing, allocated \ref apop_name structure. \param add_me A string. If \c NULL, do nothing; return -1. \param type 'r': add a row name
'c': add a matrix column name
't': add a text column name
'h': add a title (i.e., a header).
'v': add (or overwrite) the vector name
\return Returns the number of rows/cols/depvars after you have added the new one. But if \c add_me is \c NULL, return -1. */ int apop_name_add(apop_name * n, char const *add_me, char type){ if (!add_me) return -1; if (type == 'h'){ free(n->title); Asprintf(&n->title, "%s", add_me); return 1; } if (type == 'v'){ n->vector = realloc(n->vector, strlen(add_me) + 1); strcpy(n->vector, add_me); return 1; } if (type == 'r'){ n->rowct++; n->row = realloc(n->row, sizeof(char*) * n->rowct); n->row[n->rowct -1] = malloc(strlen(add_me) + 1); strcpy(n->row[n->rowct -1], add_me); return n->rowct; } if (type == 't'){ n->textct++; n->text = realloc(n->text, sizeof(char*) * n->textct); n->text[n->textct -1] = malloc(strlen(add_me) + 1); strcpy(n->text[n->textct -1], add_me); return n->textct; } //else assume (type == 'c') Apop_stopif(type != 'c', /*keep going.*/, 2,"You gave me >%c<, I'm assuming you meant c; " " copying column names.", type); n->colct++; n->col = realloc(n->col, sizeof(char*) * n->colct); n->col[n->colct -1] = malloc(strlen(add_me) + 1); strcpy(n->col[n->colct -1], add_me); return n->colct; } /** Prints the given list of names to stdout. Useful for debugging. \param n The \ref apop_name structure */ void apop_name_print(apop_name * n){ if (!n) { printf("NULL"); return; } if (n->title) printf("title: %s\n", n->title); if (n->vector){ printf("vector:"); printf("\t%s\n", n->vector); } if (n->colct > 0){ printf("column:"); for (int i=0; i < n->colct; i++) printf("\t%s", n->col[i]); printf("\n"); } if (n->textct > 0){ printf("text:"); for (int i=0; i < n->textct; i++) printf("\t%s", n->text[i]); printf("\n"); } if (n->rowct > 0){ printf("row:"); for (int i=0; i < n->rowct; i++) printf("\t%s", n->row[i]); printf("\n"); } } /** Free the memory used by an \ref apop_name structure. */ void apop_name_free(apop_name * free_me){ if (!free_me) return; //only needed if users are doing tricky things like newdata = (apop_data){.matrix=...}; for (size_t i=0; i < free_me->colct; i++) free(free_me->col[i]); for (size_t i=0; i < free_me->textct; i++) free(free_me->text[i]); for (size_t i=0; i < free_me->rowct; i++) free(free_me->row[i]); if (free_me->vector) free(free_me->vector); free(free_me->col); free(free_me->text); free(free_me->row); free(free_me); } /** Append one list of names to another. If the first list is empty, then this is a copy function. \param n1 The first set of names (no default, must not be \c NULL) \param nadd The second set of names, which will be appended after the first. (no default. If \c NULL, a no-op.) \param type1 Either 'c', 'r', 't', or 'v' stating whether you are merging the columns, rows, text, or vector. If 'v', then ignore \c typeadd and just overwrite the target vector name with the source name. (default: 'r') \param typeadd Either 'c', 'r', 't', or 'v' stating whether you are merging the columns, rows, or text. If 'v', then overwrite the target with the source vector name. (default: type1) */ #ifdef APOP_NO_VARIADIC void apop_name_stack(apop_name * n1, apop_name *nadd, char type1, char typeadd){ #else apop_varad_head(void, apop_name_stack){ apop_name * apop_varad_var(nadd, NULL); if (!nadd) return; apop_name * apop_varad_var(n1, NULL); Apop_stopif(!n1, return, 0, "Can't stack onto a NULL set of names (which n1 is)."); char apop_varad_var(type1, 'r'); char apop_varad_var(typeadd, type1); apop_name_stack_base(n1, nadd, type1, typeadd); } void apop_name_stack_base(apop_name * n1, apop_name *nadd, char type1, char typeadd){ #endif int i; apop_name counts = (apop_name){.rowct=nadd->rowct, .textct = nadd->textct, .colct = nadd->colct};//Necessary when stacking onto self.; if (typeadd == 'v') apop_name_add(n1, nadd->vector, 'v'); else if (typeadd == 'r') for (i=0; i< counts.rowct; i++) apop_name_add(n1, nadd->row[i], type1); else if (typeadd == 't') for (i=0; i< counts.textct; i++) apop_name_add(n1, nadd->text[i], type1); else if (typeadd == 'c') for (i=0; i< counts.colct; i++) apop_name_add(n1, nadd->col[i], type1); else Apop_notify(1, "'%c' sent to apop_name_stack, but the only " "valid options are r t c v. Doing nothing.", typeadd); } /** Copy one \ref apop_name structure to another. That is, all data is duplicated. Used internally by \ref apop_data_copy, but sometimes useful by itself. For example, say that we have an \ref apop_data struct named \c d and a \ref gsl_matrix of the same dimensions named \c m; we could give \c m the labels from \c d for printing: \code apop_data *wrapped = &(apop_data){.matrix=m, .names=apop_name_copy(d)}; apop_data_print(wrapped); apop_name_free(wrapped->names); //wrapped itself is auto-allocated; do not free. \endcode \param in The input names \return A \ref apop_name struct with copies of all input names. */ apop_name * apop_name_copy(apop_name *in){ apop_name *out = apop_name_alloc(); apop_name_stack(out, in, 'v'); apop_name_stack(out, in, 'c'); apop_name_stack(out, in, 'r'); apop_name_stack(out, in, 't'); Asprintf(&out->title, "%s", in->title); return out; } /** Finds the position of an element in a list of names. The function uses POSIX's \c strcasecmp, and so does case-insensitive search the way that function does. \param n the \ref apop_name object to search. \param name the name you seek; see above. \param type \c 'c' (=column), \c 'r' (=row), or \c 't' (=text). Default is \c 'c'. \return The position of \c findme. If \c 'c', then this may be -1, meaning the vector name. If not found, returns -2. On error, e.g. name==NULL, returns -2. */ int apop_name_find(const apop_name *n, const char *name, const char type){ Apop_stopif(!name, return -2, 0, "You asked me to search for NULL."); char **list; int listct; if (type == 'r' || type == 'R'){ list = n->row; listct = n->rowct; } else if (type == 't' || type == 'T'){ list = n->text; listct = n->textct; } else { // default type == 'c' list = n->col; listct = n->colct; } for (int i = 0; i < listct; i++) if (!strcasecmp(name, list[i])) return i; if ((type=='c' || type == 'C') && n->vector && !strcasecmp(name, n->vector)) return -1; return -2; } apophenia-1.0+ds/apop_output.c000066400000000000000000000364521262736346100164670ustar00rootroot00000000000000 /** \file Some printing and output interface functions. */ /* Copyright (c) 2006--2007, 2009 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ //The reader will find a few function headers for this file in asst.h #include "apop_internal.h" #define Output_vars output_name, output_pipe, output_type, output_append #define Output_declares char const * output_name, FILE * output_pipe, char output_type, char output_append /** If you're reading this, it is probably because you were referred by another function that uses this internally. You should never call this function directly, but do read this documentation. There are four settings that affect how output happens, which can be set when you call the function that sent you to this documentation, e.g: \code apop_data_print(your_data, .output_type ='f', .output_append = 'w'); \endcode \param output_name The name of the output file, if any. For a database, the table to write. \param output_pipe If you have already opened a file and have a \c FILE* on hand, use this instead of giving the file name. \param output_type \c 'p' = pipe, \c 'f'= file, \c 'd' = database \param output_append \c 'a' = append (default), \c 'w' = write over. At the end, \c output_name, \c output_pipe, and \c output_type are all set. Notably, the local \c output_pipe will have the correct location for the calling function to \c fprintf to. \li See \ref legi for more discussion. \li The default is output to stdout. For example, \code apop_data_print(your_data); //is equivalent to apop_data_print(your_data, .output_type='p', .output_pipe=stdout); \endcode \li Tip: if writing to the database, you can get a major speed boost by wrapping the call in a begin/commit wrapper: \code apop_query("begin;"); apop_data_print(your_data, .output_name="dbtab", .output_type='d'); apop_query("commit;"); \endcode \ingroup all_public */ int apop_prep_output(char const *output_name, FILE ** output_pipe, char *output_type, char *output_append){ *output_append = *output_append ? *output_append : 'w'; if (!output_name && !*output_pipe && !*output_type) *output_type = 's'; else if (output_name && !*output_pipe && !*output_type) *output_type = 'f'; else if (!output_name && *output_pipe && !*output_type) *output_type = 'p'; if (*output_type =='p') *output_pipe = *output_pipe ? *output_pipe: stdout; else if (*output_type =='d') *output_pipe = stdout; //won't be used. else *output_pipe = output_name ? fopen(output_name, *output_append == 'a' ? "a" : "w") : stdout; Apop_stopif(!output_pipe && output_name, return -1, 0, "Trouble opening file %s.", output_name); return 0; } #define Dispatch_output \ char const *apop_varad_var(output_name, NULL); \ FILE * apop_varad_var(output_pipe, NULL); \ char apop_varad_var(output_type, 0); \ char apop_varad_var(output_append, 0); \ Apop_stopif(apop_prep_output(output_name, &output_pipe, &output_type, &output_append), \ return, 0, "Trouble preparing to write output."); /////The printing functions. static void white_pad(int ct){ for(size_t i=0; i < ct; i ++) printf(" "); } /* This function prettyprints the \c apop_data set to a screen. It is currently not in the documentation. It'd be nice to merge this w/apop_data_print. This takes a lot of machinery. I write every last element to a text array, then measure column widths, then print to screen with padding to guarantee that everything lines up. There's no way to have the first element of a column line up with the last unless you interrogate the width of every element in the column, so printing columns really can't be a one-pass process. So, I produce an \ref apop_data set with no numeric elements and a text element to be filled with the input data set, and then print that. That means that I'll be using (more than) twice the memory to print this. If this is a problem, you can use \ref apop_data_print to dump your data to a text file, and view the text file, or print subsets. For more machine-readable printing, see \ref apop_data_print. */ void apop_data_show(const apop_data *in){ if (!in) {printf("NULL\n"); return;} Get_vmsizes(in) //vsize, msize1, msize2, tsize //Take inventory and get sizes size_t hasrownames = (in->names && in->names->rowct) ? 1 : 0; size_t hascolnames = in->names && (in->names->vector || in->names->colct || in->names->textct); size_t hasweights = (in->weights != NULL); size_t outsize_r = GSL_MAX(in->matrix ? in->matrix->size1 : 0, in->vector ? in->vector->size: 0); outsize_r = GSL_MAX(outsize_r, in->textsize[0]); outsize_r = GSL_MAX(outsize_r, wsize); if (in->names) outsize_r = GSL_MAX(outsize_r, in->names->rowct); outsize_r += hascolnames; size_t outsize_c = msize2; outsize_c += in->textsize[1]; outsize_c += (vsize>0); outsize_c += (wsize>0); outsize_c += hasrownames + hasweights; //Write to the printout data set. apop_data *printout = apop_text_alloc(NULL , outsize_r, outsize_c); if (hasrownames) for (size_t i=0; i < in->names->rowct; i ++) apop_text_set(printout, i + hascolnames, 0, "%s", in->names->row[i]); for (size_t i=0; i < vsize; i ++) //vsize may be zero. apop_text_set(printout, i + hascolnames, hasrownames, "%g", gsl_vector_get(in->vector, i)); for (size_t i=0; i < msize1; i ++) //msize1 may be zero. for (size_t j=0; j < msize2; j ++) apop_text_set(printout, i + hascolnames, hasrownames + (vsize >0)+ j, "%g", gsl_matrix_get(in->matrix, i, j)); if (in->textsize[0]) for (size_t i=0; i < in->textsize[0]; i ++) for (size_t j=0; j < in->textsize[1]; j ++) apop_text_set(printout, i + hascolnames, hasrownames + (vsize>0)+ msize2 + j, "%s", in->text[i][j]); if (hasweights) for (size_t i=0; i < in->weights->size; i ++) apop_text_set(printout, i + hascolnames, outsize_c-1, "%g", gsl_vector_get(in->weights, i)); //column names if (hascolnames){ if (vsize && in->names->vector) apop_text_set(printout, 0 , hasrownames, "%s", in->names->vector); if (msize2 && in->names) for (size_t i=0; i < in->names->colct; i ++) apop_text_set(printout, 0 , hasrownames + (vsize>0) + i, "%s", in->names->col[i]); if (in->textsize[1] && in->names) for (size_t i=0; i < in->names->textct; i ++) apop_text_set(printout, 0 , hasrownames + (vsize>0) + msize2 + i, "%s", in->names->text[i]); if (hasweights) apop_text_set(printout, 0 , outsize_c-1, "Weights"); } //get column sizes int colsizes[outsize_c]; for (size_t i=0; i < outsize_c; i ++){ colsizes[i] = strlen(printout->text[0][i]); for (size_t j=1; j < outsize_r; j ++) colsizes[i] = GSL_MAX(colsizes[i], strlen(printout->text[j][i])); } //Finally, print if (in->names && in->names->title && strlen(in->names->title)) printf("\t%s\n\n", in->names->title); for (size_t j=0; j < outsize_r; j ++){ for (size_t i=0; i < outsize_c; i ++){ white_pad(colsizes[i] - strlen(printout->text[j][i]) + 1);//one spare space. printf("%s", printout->text[j][i]); if (i > 0 && i< outsize_c-1) printf(" %s ", apop_opts.output_delimiter); } printf("\n"); } if (in->more) { printf("\n"); apop_data_show(in->more); } apop_data_free(printout); } void p_fn(FILE * f, double data){ if (data == (int) data) fprintf(f, "% 5i", (int) data); else fprintf(f, "% 5f", data); } static void print_core_v(const gsl_vector *data, char *separator, Output_declares){ FILE *f = output_pipe; if (!data) fprintf(f, "NULL\n"); else { for (size_t i=0; isize; i++){ p_fn(f, gsl_vector_get(data, i)); if (i< data->size -1) fprintf(f, "%s", separator); } fprintf(f,"\n"); } if (output_name) fclose(f); } /** Print a vector to the screen, a file, a pipe, or the database. \li See \ref apop_prep_output for more on how printing settings are set. \li For example, the default for \ref apop_opts_type "apop_opts.output_delimiter" is a tab, which puts the vector on one line, but apop_opts.output_type="\n" would print the vector vertically. \li See also \ref Legi for more details and examples. \li This function uses the \ref designated syntax for inputs. \ingroup all_public */ #ifdef APOP_NO_VARIADIC void apop_vector_print(gsl_vector *data, Output_declares){ #else apop_varad_head(void, apop_vector_print){ gsl_vector *apop_varad_var(data, NULL); Dispatch_output apop_vector_print_base(data, Output_vars); } void apop_vector_print_base(gsl_vector *data, Output_declares){ #endif print_core_v(data, apop_opts.output_delimiter, Output_vars); } /* currently removed from the documentation. Dump a gsl_vector to the screen. You may want to set \ref apop_opts_type "apop_opts.output_delimiter". \li See \ref apop_prep_output for more on how printing settings are set. \li See also \ref Legi for more details and examples. \li This function uses the \ref designated syntax for inputs. */ void apop_vector_show(const gsl_vector *data){ print_core_v(data, apop_opts.output_delimiter, NULL, stdout, 's', 0); } static int get_max_strlen(char **names, size_t len){ int max = 0; for (int i=0; i< len; i++) max = GSL_MAX(max, strlen(names[i])); return max; } //On screen, display a pipe, else use the usual output delimiter. static void a_pipe(FILE *f, char displaytype){ if (displaytype == 's') fprintf(f, " | "); else fprintf(f, "%s", apop_opts.output_delimiter); } static void apop_data_print_core(const apop_data *data, FILE *f, char displaytype){ if (!data){ fprintf(f, "NULL\n"); return; } int i, j, L = 0, start = (data->vector)? -1 : 0, end = (data->matrix)? data->matrix->size2 : 0, rowend = (data->matrix)? data->matrix->size1 : (data->vector) ? data->vector->size : data->text ? data->textsize[0] : -1; if (data->names && data->names->title && strlen(data->names->title)) fprintf(f, "\t%s\n\n", data->names->title); if (data->names && data->names->rowct) L = get_max_strlen(data->names->row, data->names->rowct); if (data->names && data->names->rowct && (data->names->vector || data->names->colct || data->names->textct)){ if ((apop_opts.db_name_column || *apop_opts.db_name_column=='\0') || !strcmp(apop_opts.db_name_column, "row_names")) fprintf(f, "%*s ", L+2, " "); else { fprintf(f, "%s", apop_opts.db_name_column); a_pipe(f, displaytype); } } if (data->vector && data->names && data->names->vector){ fprintf(f, "%s", data->names->vector); } if (data->matrix){ if (data->vector && data->names && data->names->colct){ fprintf(f, "%c ", data->names->vector ? ' ' : '\t' ); a_pipe(f, displaytype); } if (data->names) for(i=0; i< data->names->colct; i++){ if (i < data->names->colct -1) fprintf(f, "%s%s", data->names->col[i], apop_opts.output_delimiter); else fprintf(f, "%s", data->names->col[i]); } } if (data->textsize[1] && data->names && data->names->textct){ if ((data->vector && data->names && data->names->vector) || (data->matrix && data->names->colct)) a_pipe(f, displaytype); if (data->names) for(i=0; i< data->names->textct; i++){ if (i < data->names->textct -1) fprintf(f, "%s%s", data->names->text[i], apop_opts.output_delimiter); else fprintf(f, "%s", data->names->text[i]); } } if(data->names && (data->names->vector || data->names->colct || data->names->textct)) fprintf(f, "\n"); for(j=0; j< rowend; j++){ if (data->names && data->names->rowct > j) fprintf(f, "%*s%s", L+2, data->names->row[j], apop_opts.output_delimiter); for(i=start; i< end; i++){ if ((i < 0 && j < data->vector->size) || (i>= 0 && j < data->matrix->size1 && i < data->matrix->size2)) p_fn(f, apop_data_get(data, j, i)); else fprintf(f, " "); if (i==-1 && data->matrix) a_pipe(f, displaytype); if (i < end-1) fprintf(f, "%s", apop_opts.output_delimiter); } if (data->text){ if (data->vector || data->matrix) a_pipe(f, displaytype); if (j < data->textsize[0]) for(i=0; i< data->textsize[1]; i++){ fprintf(f, "%s", data->text[j][i]); if (i < data->textsize[1]-1) fprintf(f, "%s", apop_opts.output_delimiter); } } if (data->weights && j < data->weights->size){ a_pipe(f, displaytype); p_fn(f, data->weights->data[j]); } fprintf(f, "\n"); } } /** Print an \ref apop_data set to a file, the database, or the screen, as determined by the \c .output_type. \li See \ref apop_prep_output for more on how printing settings are set. \li See \ref Legi for more details and examples. \li See \ref sqlsec for notes on writing an \ref apop_data set to the database. \li This function uses the \ref designated syntax for inputs. \ingroup all_public */ #ifdef APOP_NO_VARIADIC void apop_data_print(const apop_data *data, Output_declares){ #else apop_varad_head(void, apop_data_print){ const apop_data * apop_varad_var(data, NULL); Dispatch_output apop_data_print_base(data, Output_vars); } void apop_data_print_base(const apop_data *data, Output_declares){ #endif if (output_type == 'd'){ if (output_append == 'w') apop_table_exists(output_name, 'd'); apop_data_to_db(data, output_name, output_append); return; } apop_data_print_core(data, output_pipe, output_type); if (data && data->more) { output_append='a'; apop_data_print(data->more, Output_vars); } if (output_name) fclose(output_pipe); } /** Print a \c gsl_matrix to the screen, a file, a pipe, or a database table. \li See \ref apop_prep_output for more on how printing settings are set. \li See also \ref Legi for more details and examples. \li This function uses the \ref designated syntax for inputs. \ingroup all_public */ #ifdef APOP_NO_VARIADIC void apop_matrix_print(const gsl_matrix *data, Output_declares){ #else apop_varad_head(void, apop_matrix_print){ const gsl_matrix *apop_varad_var(data, NULL); Dispatch_output apop_matrix_print_base(data, Output_vars); } void apop_matrix_print_base(const gsl_matrix *data, Output_declares){ #endif if (output_type == 'd'){ Apop_assert_c(data, , 1, "You sent me a NULL matrix. No database table will be created."); } else if (!data){ fprintf(output_pipe, "NULL\n"); return; } apop_data_print(&(apop_data){.matrix=(gsl_matrix*)data}, Output_vars); //cheating on the const qualifier } //leaving this undocumented for now. void apop_matrix_show(const gsl_matrix *data){ apop_data_print_core(&(apop_data){.matrix=(gsl_matrix*)data}, stdout, 's'); } apophenia-1.0+ds/apop_rake.c000066400000000000000000000726121262736346100160470ustar00rootroot00000000000000 //#define __USE_POSIX //for strtok_r #include "apop_internal.h" #include #include void xprintf(char **q, char *format, ...); //in apop_conversions.c /* This is the internal documentation for apop_rake(). I assume you've read the usage documentation already (if you haven't, it's below, just above apop_rake). I started with: Algorithm AS 51 Appl. Statist. (1972), vol. 21, p. 218 original (C) Royal Statistical Society 1972 But at this point, I'm not sure if any of the original code remains at all, because the sparse method and the full-matrix method differ so much. There are two phases to the process: the SQL part and the in-memory table part. The file here begins with the indexing of the in-memory table. The index on the in-memory table is somewhat its own structure, with get/set functions and so forth, so it has its own section, listed first in this file. After that, we get to the raking function (c_loglin) and its supporting functions, and then the apop_rake function itself. The SQL work itself is inside apop_rake, and c_loglin is called at the end to do the raking. */ /* This section indexes a PMF-type apop_data struct. The index is held in a 2-d grid of nodes. The index's first index is the dimension, and will hold one column for each column in the original data set. The index's second index goes over all of the values in the given column. PS: this was intended to one day become a general-purpose index for PMFs; making that happen remains on the to-do list. mnode[i] = a dimension row mnode[i][j] = a value in a given dimension mnode[i][j].margin_ptrs = a list of all of the rows in the data set with the given value. mnode[i][j].margin_ptrs[k] = the kth item for the value. */ /** \cond doxy_ignore */ typedef struct { double val; bool *margin_ptrs, *fit_ptrs; } mnode_t; typedef void(*index_apply_f)(mnode_t * const * const, int, void*); /** \endcond */ static int find_val(double findme, mnode_t *nodecol){ for (int i=0; nodecol[i].val <= findme || gsl_isnan(findme); i++) if (nodecol[i].val == findme || (gsl_isnan(findme) && gsl_isnan(nodecol[i].val))) return i; return -1; } //returns -1 if the given row/col doesn't exist. int index_add_node(mnode_t **mnodes, size_t dim, size_t row, double val, bool is_margin){ int index = find_val(val, mnodes[dim]); if (index == -1) return -1; if (is_margin) mnodes[dim][index].margin_ptrs[row] = true; else mnodes[dim][index].fit_ptrs[row] = true; return 0; } mnode_t **index_generate(apop_data const *in, apop_data const *in2){ size_t margin_ct = in->matrix->size2; size_t margin_rows = in->matrix->size1; size_t fit_rows = in2->matrix->size1; mnode_t **mnodes = malloc(sizeof(mnode_t*)*(margin_ct+1)); //allocate every node for(size_t i=0; i < margin_ct; i ++){ gsl_vector *vals = apop_vector_unique_elements(Apop_cv(in, i)); mnodes[i] = malloc(sizeof(mnode_t)*(vals->size+1)); for(size_t j=0; j < vals->size; j ++) mnodes[i][j] = (mnode_t) {.val = gsl_vector_get(vals, j), .margin_ptrs = calloc(margin_rows, sizeof(bool)), .fit_ptrs = calloc(fit_rows, sizeof(bool)) }; mnodes[i][vals->size] = (mnode_t) {.val = GSL_POSINF}; //end-of-array sentinel gsl_vector_free(vals); } mnodes[margin_ct] = NULL; //end-of-array sentinel //put data from the matrix into the right pigeonhole for(size_t i=0; i < margin_ct; i++){ for(size_t j=0; j < in->matrix->size1; j++) Apop_stopif(index_add_node(mnodes, i, j, apop_data_get(in, j, i), true) == -1, return mnodes, 0, "I can't find a value, %g, that should've already been inserted.", apop_data_get(in, j, i)); for(size_t j=0; j < in2->matrix->size1; j++) //these values may not be present, in which case ignore them. index_add_node(mnodes, i, j, apop_data_get(in2, j, i), false); } return mnodes; } void index_free(mnode_t **in){ for (int i=0; in[i]; i++){ for (int j=0; !isinf(in[i][j].val); j++){ free(in[i][j].margin_ptrs); free(in[i][j].fit_ptrs); } free(in[i]); } free(in); } /* The next two functions are a recursive iteration over all combinations of values for a given index (which may be the whole thing or a margin). index_foreach does the initialization of some state variables; value_loop does the odometer-like recursion. At each step, value_loop will either increment the current dimension's index, or if the current index is at its limit, will loop back to zero on this index and then set the next dimension as active. */ static void value_loop(mnode_t *icon[], int *indices, mnode_t **values, int this_dim, index_apply_f f, int *ctr, void *args){ while (!isinf(icon[this_dim][indices[this_dim]].val)){ values[this_dim] = &icon[this_dim][indices[this_dim]]; if (icon[this_dim+1]){ indices[this_dim+1]=0; value_loop(icon, indices, values, this_dim+1, f, ctr, args); } else{ f(values, *ctr, args); (*ctr)++; } indices[this_dim]++; } } /* Inputs: the index to be iterated over, the function to apply to each combination, and a void* with other sundry arguments to the function. I'll apply the function to an mnode_t* with a single mnode_t for each dimension. */ void index_foreach(mnode_t *index[], index_apply_f f, void *args){ int j, ctr=0; for (j=0; index[j]; ) j++; mnode_t *values[j+1]; for (j=0; index[j]; j++) values[j] = &index[j][0]; values[j] = malloc(sizeof(mnode_t)); values[j]->val = GSL_POSINF; int indices[j]; memset(indices, 0, sizeof(int)*j); value_loop(index, indices, values, 0, f, &ctr, args); free(values[j]); } /* Check whether an observation falls into all of the given margins. This is for a contrast. We've already calculated the in/out vector for each value on each margin, and now need to && each dimension into one vector. \param index Is actually a partial index: for each dimension, there should be only one value. Useful for the center of an index_foreach loop. \param d Should already be allocated to the right size, may be filled with garbage. \return d will be zero or one to indicate which rows of the indexed data set meet all criteria. */ void index_get_element_list(mnode_t *const * index, bool *d, size_t len, bool is_margin){ memcpy(d, is_margin ? index[0]->margin_ptrs: index[0]->fit_ptrs, len * sizeof(bool)); for(size_t i=1; !isinf(index[i]->val); i++){ bool *ptr_list = is_margin ? index[i]->margin_ptrs: index[i]->fit_ptrs; for(size_t j=0; j < len; j++) d[j] &= ptr_list[j]; } } ////End index.c /** \cond doxy_ignore */ typedef struct { const apop_data *indata; apop_data *fit; size_t **elmtlist; size_t *elmtlist_sizes; gsl_vector *indata_values; mnode_t **index; size_t ct, al; double *maxdev; } rake_t; /** \endcond */ static void rakeinfo_grow(rake_t *r){ r->al = (r->al+1)*2; r->elmtlist = realloc(r->elmtlist , sizeof(size_t*) * r->al); r->elmtlist_sizes = realloc(r->elmtlist_sizes, sizeof(size_t) * r->al); r->indata_values = apop_vector_realloc(r->indata_values, r->al); } static void rakeinfo_free(rake_t r){ #define free_and_clear(in) free(in), (in) = NULL for (int i=0; i < r.ct; i++) free_and_clear(r.elmtlist[i]); free_and_clear(r.index); //these are just pointers to the main index. free_and_clear(r.elmtlist_sizes); free_and_clear(r.elmtlist); gsl_vector_free(r.indata_values); r.indata_values= NULL; } static void scaling(size_t const *elmts, size_t const n, gsl_vector *weights, double const in_sum, double *maxdev){ double fit_sum = 0, out_sum=0; for(size_t i=0; i < n; i ++) fit_sum += weights->data[elmts[i]]; if (!fit_sum) return; //can happen if init table is very different from margins. for(size_t i=0; i < n; i ++){ out_sum += weights->data[elmts[i]] *= in_sum/fit_sum; } *maxdev = GSL_MAX(fabs(fit_sum - out_sum), *maxdev); } /* Given one set of values from one margin, do the actual scaling. On the first pass, this function takes notes on each margin's element list and total in the original data. Later passes just read the notes and call the scaling() function above. */ static void one_set_of_values(mnode_t *const * const margincons, int ctr, void *in){ rake_t *r = in; size_t marginsize = r->indata->matrix->size1; size_t fitsize = r->fit->matrix->size1; bool melmts[marginsize]; bool fitelmts[fitsize]; bool first_pass = false; double in_sum; if (ctr < r->ct) in_sum = gsl_vector_get(r->indata_values, ctr); else { r->ct++; if (ctr >= r->al || r->al==0) rakeinfo_grow(r); index_get_element_list(margincons, melmts, marginsize, true); index_get_element_list(margincons, fitelmts, fitsize, false); in_sum = 0; int n=0, al=0; r->elmtlist[ctr] = NULL; //use margin index to get total for this margin. for(int m=0; m < marginsize; m++) if (melmts[m]) in_sum += r->indata->weights->data[m]; //use fit index to get elements involved in this margin for(int m=0; m < fitsize; m++) if (fitelmts[m]){ if (n >= al) { al = (al+1)*2; r->elmtlist[ctr] = realloc(r->elmtlist[ctr], al*sizeof(size_t)); } r->elmtlist[ctr][n++] = m; } r->elmtlist_sizes[ctr] = n; r->indata_values->data[ctr] = in_sum; first_pass = true; } if (!r->elmtlist_sizes[ctr]) return; if (!first_pass && !in_sum) return; scaling(r->elmtlist[ctr], r->elmtlist_sizes[ctr], r->fit->weights, in_sum, r->maxdev); } /* For each configuration margin, for each combination for that margin, call the above one_set_of_values() function. */ static void main_loop(int config_ct, rake_t *rakeinfo, int k){ for (size_t i=0; i < config_ct; i ++) if (k==1) index_foreach(rakeinfo[i].index, one_set_of_values, rakeinfo+i); else for(int m=0; m < rakeinfo[i].ct; m++) one_set_of_values(NULL, m, rakeinfo+i); } /* Following the FORTRAN, 1 contrast ==> icon. Here, icon will be a subset of the main index including only the columns pertaining to a given margin. */ void generate_margin_index(mnode_t **icon, const apop_data *margin, mnode_t **mainindex, size_t col){ gsl_vector *iconv = Apop_cv(margin, col); int ct = 0; for (int j=0; mainindex[j]; j++) if (gsl_vector_get(iconv, j)) icon[ct++] = mainindex[j]; icon[ct] = NULL; } void cleanup(mnode_t **index, rake_t rakeinfos[], int contrast_ct){ for(size_t i=0; i < contrast_ct; i++) rakeinfo_free(rakeinfos[i]); index_free(index); } /* \param config An nvar x ncon matrix; see below. [as in the original, but not squashed into 1-D.] \param indata the actual table. I use a PMF format. \param fit the starting table. Same size as table. \param maxdev maximum deviation; stop when this is met. \param maxit maximum iterations; stop when this is met. Re: the contrast array: Each _column_ is a contrast. Put a one in each col involved in the contrast. E.g., for (1,2), (2,3): 1 0
1 1
0 1 */ static void c_loglin(const apop_data *config, const apop_data *indata, apop_data *fit, double tolerance, int maxit) { mnode_t ** index = index_generate(indata, fit); /* Make a preliminary adjustment to obtain the fit to an empty configuration list */ //fit->weights is either all 1 (for synthetic data) or the initial counts from the db. double x = apop_sum(indata->weights); double y = apop_sum(fit->weights); gsl_vector_scale(fit->weights, x/y); int contrast_ct =config && config->matrix ? config->matrix->size2 : 0; rake_t rakeinfos[contrast_ct]; double maxdev=0; for(size_t i=0; i < contrast_ct; i ++){ gsl_vector *iconv = Apop_cv(config, i); rakeinfos[i] = (rake_t) { .indata = indata, .fit = fit, .maxdev = &maxdev, //one value shared across dimensions .index = malloc(sizeof(mnode_t) *(apop_sum(iconv)+1)), //others are NULL, to be filled in as we go. }; generate_margin_index(rakeinfos[i].index, config, index, i); } int k; gsl_vector *previous = apop_vector_copy(fit->weights); for (k = 1; k <= maxit; ++k) { maxdev = 0; main_loop(contrast_ct, rakeinfos, k); Apop_notify(3, "Data set after round %i of raking.\n", k); if (apop_opts.verbose >=3) apop_data_print(fit, .output_pipe=apop_opts.log_file); if (maxdev < tolerance) break;// Normal termination gsl_vector_memcpy(previous, fit->weights); } cleanup(index, rakeinfos,contrast_ct); gsl_vector_free(previous); Apop_stopif(k == maxit, fit->error='c', 0, "Maximum number of iterations reached."); } char *pipe_parse = "[ \n\t]*([^| \n\t]+)[ \n\t]*([|]|$)"; apop_data **generate_list_of_contrasts(char *const *contras_in, int contrast_ct){ apop_data** out = malloc(sizeof(apop_data*)* contrast_ct); for (int i=0; i< contrast_ct; i++) { apop_regex(contras_in[i], pipe_parse, out+i); apop_text_alloc(out[i], *out[i]->textsize, 1); } return out; } apop_data *get_var_list(char const *margin_table, char const *count_col, char const *init_count_col, char * const *varlist, int *var_ct){ apop_data *all_vars_d=NULL; if (!varlist){ Apop_stopif(apop_opts.db_engine=='m', apop_return_data_error(y), 0, "I need a list of the full set of variable " "names sent as .varlist= (char *[]){\"var1\", \"var2\",...}"); //use SQLite's table_info, then shift the second col to the first. all_vars_d = apop_query_to_text("PRAGMA table_info(%s)", margin_table); int ctr=0; for (int i=0; i< all_vars_d->textsize[0]; i++) if (all_vars_d->text[i][1] && (count_col ? strcmp(all_vars_d->text[i][1], count_col) : 1) && (init_count_col ? strcmp(all_vars_d->text[i][1], init_count_col): 1)) apop_text_set(all_vars_d, 0, ctr++, all_vars_d->text[i][1]); apop_text_alloc(all_vars_d, 1, ctr); *var_ct = ctr; } else { all_vars_d = apop_text_alloc(NULL, 1, *var_ct); for (int i=0; i<*var_ct; i++) apop_text_set(all_vars_d, 0, i, varlist[i]); } Apop_stopif(!all_vars_d, apop_return_data_error(y), 0, "Trouble getting/parsing the list of variables."); return all_vars_d; } static int get_var_index(char *const *all_vars, int len, char *findme){ for (int i=0; i< len; i++) if (all_vars[i] && !strcmp(all_vars[i], findme)) return i; Apop_notify(0, "I couldn't find %s in the full list of variables. Returning -1.", findme); return -1; } void nan_to_zero(double *in){ if (gsl_isnan(*in)) *in=0;} double nudge_zeros(apop_data *in, void *nudge){ if (!in->weights->data[0]) in->weights->data[0] = *(double*)nudge; return 0; } int find_in_allvars(char const *in, apop_data const *allvars){ for (int i=0; i< allvars->textsize[1]; i++) if (!strcmp(allvars->text[0][i], in)) return i; Apop_stopif(1, return -1, 0, "Variable in your contrast list [%s] not in " "your list of all variables.", in); } /* If you are fully synthesizing or nudging zero cells, then calculate the set of cells that could be nonzero (given the margin information). * If using only an init_table, then that's your list of nonzero cells right there. * If you have an init_table but want to nudge the zero cells up a bit, then you need this, and have to merge with the init_table * If you have no init_table, then this list is all the cells. */ static int setup_nonzero_contrast(char const *margin_table, apop_data const * allvars, int run_number, char const *list_of_fields, apop_data *const* contras, int contrast_ct, double nudge, char const * structural_zeros, bool have_init_table){ char *q; bool used[allvars->textsize[1]]; memset(used, 0, sizeof(bool)*allvars->textsize[1]); Asprintf(&q, "create table apop_zerocontrasts_%i as select %s, %g from\n", run_number, apop_text_paste(allvars, .between=","), nudge); for (int i=0; i < contrast_ct; i++){ xprintf(&q, "%s%s (select distinct %s from %s) \n", q, i>0 ? "natural join" : "", apop_text_paste(contras[i], .between=","), margin_table); for (int j=0; j<*contras[i]->textsize; j++){ int val=find_in_allvars(*contras[i]->text[j], allvars); Apop_stopif(val==-1, return -1, 0, "Error setting up contrasts"); used[val]=true; } } //make sure all variables are joined in. for (int i=0; i < allvars->textsize[1]; i++) if (!used[i]) xprintf(&q, "%s%s (select distinct %s from %s)\n", q, (!contrast_ct && i==0) ? "" : "natural join", allvars->text[0][i], margin_table); if (structural_zeros) xprintf(&q, "%s where not (%s)\n", q, structural_zeros); if (have_init_table){ //Keep out margins with values for now; join them in below. xprintf(&q, "%s except\nselect %s, %g from %s", q, list_of_fields, nudge, margin_table); } Apop_stopif(apop_query("%s", q), return 1, 0, "query failed."); free(q); return 0; } /** Fit a log-linear model via iterative proportional fitting, aka raking. Raking has many uses. The Modeling with Data blog presents a series of discussions of uses of raking, including some worked examples. Or see Wikipedia for an overview of Log linear models, aka Poisson regressions. One approach toward log-linear modeling is a regression form; let there be four categories, A, B, C, and D, from which we can produce a model positing, for example, that cell count is a function of a form like \f$g_1(A) + g_2(BC) + g_3(CD)\f$. In this case, we would assign a separate coefficient to every possible value of A, every possible value of (B, C), and every value of (C, D). Raking is the technique that searches for that large set of parameters. The combinations of categories that are considered to be relevant are called \em contrasts, after ANOVA terminology of the 1940s. The other constraint on the search are structural zeros, which are values that you know can never be non-zero, due to field-specific facts about the variables. For example, U.S. Social Security payments are available only to those age 65 or older, so "age <65 and gets_soc_security=1" is a structural zero. Because there is one parameter for every combination, there may be millions of parameters to estimate, so the search to find the most likely value requires some attention to technique. For over half a century, the consensus method for searching has been raking, which iteratively draws each category closer to the mean in a somewhat simple manner (this was first developed circa 1940 and had to be feasible by hand), but which is guaranteed to eventually arrive at the maximum likelihood estimate for all cells. Another complication is that the table is invariably sparse. One can easily construct tables with millions of cells, but the corresponding data set may have only a few thousand observations. This function uses the database to resolve the sparseness problem. It constructs a query requesting all combinations of categories the could possibly be non-zero after raking, given all of the above constraints. Then, raking is done using only that subset. This means that the work is done on a number of cells proportional to the number of data points, not to the full cross of all categories. Set apop_opts.verbose to 2 or greater to show the query on \c stderr. \li One could use raking to generate `fully synthetic' data: start with observation-level data in a margin table. Begin the raking with a starting data set of all-ones. Then rake until the all-ones set transforms into something that conforms to the margins and (if any) structural zeros. You now have a data set which matches the marginal totals but does not use any other information from the observation-level data. If you do not specify an .init_table, then an all-ones default table will be used. \param margin_table The name of the table in the database to use for calculating the margins. The table should have one observation per row. (No default) \param var_list The full list of variables to search. A list of strings, e.g., (char *[]){"var1", "var2", ..., "var15"} \param var_ct The count of the full list of variables to search. \param contrasts The contrasts describing your model. Like the \c var_list input, a list of strings like (char *[]){"var1", "var7", "var13"} contrast is a pipe-delimited list of variable names. (No default) \param contrast_ct The number of contrasts in the list of contrasts. (No default) \param structural_zeros a SQL clause indicating combinations that can never take a nonzero value. This will go into a \c where clause, so anything you could put there is OK, e.g. "age <65 and gets_soc_security=1 or age <15 and married=1". Your margin data is not checked for structural zeros. (default: no structural zeros) \param max_iterations Number of rounds of raking at which the algorithm halts. (default: 1000) \param tolerance I calculate the change for each cell from round to round; if the largest cell change is smaller than this, I stop. (default: 1e-5) \param count_col This column gives the count of how many observations are represented by each row. If \c NULL, ech row represents one person. (default: \c NULL) \param init_table The default is to initially set all table elements to one and then rake from there. This is effectively the `fully synthetic' approach, which uses only the information in the margins and derives the data set closest to the all-ones data set that is consistent with the margins. Care is taken to maintan sparsity in this case. If you specify an \c init_table, then I will get the initial cell counts from it. (default: the fully-synthetic approach, using a starting point of an all-ones grid.) \param init_count_col The column in \c init_table with the cell counts. \param nudge There is a common hack of adding a small value to every zero entry, because a zero entry will always scale to zero, while a small value could eventually scale to anything. Recall that this function works on sparse sets, so I first filter out those cells that could possibly have a nonzero value given the observations, then I add nudge to any zero cells within that subset. \return An \ref apop_data set where every row is a single combination of variable values and the \c weights vector gives the most likely value for each cell. \exception out->error='i' Input was somehow wrong. \exception out->error='c' Raking did not converge, reached max. iteration count. \li Set apop_opts.verbose=3 to see the intermediate tables at the end of each round of raking. \li If you want all cells to have nonzero value, then you can do that via pre-processing: \code apop_query("update data_table set count_col = 1e-3 where count_col = 0"); \endcode \li This function is thread-safe. To make this happen, temp database tables are named using a number built with \c omp_get_thread_num. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_rake(char const *margin_table, char * const*var_list, int var_ct, char * const *contrasts, int contrast_ct, char const *structural_zeros, int max_iterations, double tolerance, char const *count_col, char const *init_table, char const *init_count_col, double nudge){ #else apop_varad_head(apop_data *, apop_rake){ char const * apop_varad_var(margin_table, NULL); Apop_stopif(!margin_table, apop_return_data_error(i), 0, "I need the name of a table in the database that will be the data source."); Apop_stopif(!apop_table_exists(margin_table), apop_return_data_error(i), 0, "your margin_table, %s, doesn't exist in the database.", margin_table); char *const* apop_varad_var(var_list, NULL); int apop_varad_var(var_ct, 0); char *const * apop_varad_var(contrasts, NULL); //default to all vars? int apop_varad_var(contrast_ct, 0); Apop_stopif(contrasts&&!contrast_ct, apop_return_data_error(i), 0, "you gave me a list of contrasts but not the count. " "This is C--I can't count them myself. Please provide the count and re-run."); char const * apop_varad_var(structural_zeros, NULL); char const * apop_varad_var(count_col, NULL); int apop_varad_var(max_iterations, 1e3); double apop_varad_var(tolerance, 1e-5); char const * apop_varad_var(init_count_col, NULL); char const * apop_varad_var(init_table, NULL); Apop_stopif(init_table && !apop_table_exists(init_table), apop_return_data_error(i), 0, "your init_table, %s, doesn't exist in the database.", init_table); if (init_count_col && !init_table) init_table = margin_table; double apop_varad_var(nudge, 0); return apop_rake_base(margin_table, var_list, var_ct, contrasts, contrast_ct, structural_zeros, max_iterations, tolerance, count_col, init_table, init_count_col, nudge); } apop_data * apop_rake_base(char const *margin_table, char * const*var_list, int var_ct, char * const *contrasts, int contrast_ct, char const *structural_zeros, int max_iterations, double tolerance, char const *count_col, char const *init_table, char const *init_count_col, double nudge){ #endif #ifdef OpenMP int run_number = omp_get_thread_num(); #else int run_number = 0; #endif apop_data **contras = generate_list_of_contrasts(contrasts, contrast_ct); apop_data *all_vars_d = get_var_list(margin_table, count_col, init_count_col, var_list, &var_ct); Apop_stopif(all_vars_d->error, return all_vars_d, 0, "Trouble setting up the list of variables."); int tt = all_vars_d->textsize[0]; all_vars_d->textsize[0] = 1; //mask all but the first row char *list_of_fields = apop_text_paste(all_vars_d, .between=", "); if (nudge || !init_table){ char *tab; Asprintf(&tab, "apop_zerocontrasts_%i", run_number); apop_table_exists(tab, 'd'); free(tab); Apop_stopif(setup_nonzero_contrast(margin_table, all_vars_d, run_number, list_of_fields, contras, contrast_ct, (nudge ? nudge : 1), structural_zeros, init_table), apop_return_data_error(q), 0, "Couldn't calculate the set of nonzero cells."); } char *initt=NULL; //handle structural zeros via subquery //note that margin data may have invalid rows. if (init_table){ if (structural_zeros) Asprintf(&initt, "(select * from %s where not (%s))", init_table, structural_zeros); else Asprintf(&initt, "%s", init_table); } char *init_q, *pre_init_q = NULL; if (init_table){ char *countstr; if (init_count_col) Asprintf(&countstr, "sum(%s) as %s", init_count_col, init_count_col); else Asprintf(&countstr, "count(%s)", **all_vars_d->text); Asprintf(&init_q, "select %s, %s from %s group by %s", list_of_fields, countstr, initt, list_of_fields); free(countstr); } char *marginq, *cc; if (count_col) Asprintf(&cc, "sum(%s)", count_col); else cc = strdup("count(*)"); Asprintf(&marginq, "select %s, %s from %s\ngroup by %s", list_of_fields, cc, margin_table, list_of_fields); free(cc); free(initt); char *format=strdup("w"); for (int i =0 ; i< var_ct; i++) xprintf(&format, "m%s", format); apop_data *d, *contrast_grid; d = apop_query_to_mixed_data(format, "%s", marginq); Apop_stopif(!d || d->error, apop_return_data_error(q), 0, "This query:\n%s\ngenerated a blank or broken table.", marginq); free(marginq); if (pre_init_q) Apop_stopif(apop_query("%s", pre_init_q), apop_return_data_error(q), 0, "This query:\n%s\ngenerated a blank or broken table.", pre_init_q); apop_data *fit; if (init_table) { fit = (nudge) ? apop_query_to_mixed_data(format, "%s\nunion\nselect * from apop_zerocontrasts_%i ", init_q, run_number) : apop_query_to_mixed_data(format, "%s", init_q); Apop_stopif(!fit, apop_return_data_error(q), 0, "Query returned a blank table."); Apop_stopif(fit->error, apop_return_data_error(q), 0, "Query error."); } else { fit = apop_query_to_mixed_data(format, "select * from apop_zerocontrasts_%i ", run_number); gsl_vector_set_all(fit->weights, nudge ? nudge : 1); } free(format); apop_vector_apply(fit->weights, nan_to_zero); if (nudge) apop_map(fit, .fn_rp=nudge_zeros, .param=&nudge); contrast_grid = apop_data_calloc(var_ct, contrast_ct); for (int i=0; i< contrast_ct; i++) for (int j=0; j< contras[i]->textsize[0]; j++) apop_data_set(contrast_grid, get_var_index(*all_vars_d->text, all_vars_d->textsize[1], contras[i]->text[j][0]), i, 1); if (!init_table || nudge) for (int i=0; i< contrast_ct; i++) apop_data_free(contras[i]); c_loglin(contrast_grid, d, fit, tolerance, max_iterations); apop_data_free(d); all_vars_d->textsize[0] = tt; apop_data_free(all_vars_d); if (!init_table || nudge) apop_query("drop table apop_zerocontrasts_%i", run_number); apop_data_free(contrast_grid); return fit; } apophenia-1.0+ds/apop_regression.c000066400000000000000000000654471262736346100173150ustar00rootroot00000000000000 /** \file apop_regression.c Generally, if it assumes something is Normally distributed, it's here.*/ /* Copyright (c) 2006--2007 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include //lsearch; bsearch is in stdlib. /* For use by MLE, OLS, et al. Available for public use, but undocumented. */ void apop_estimate_parameter_tests (apop_model *est){ Nullcheck_p(est, ) if (!est->data) return; apop_data *ep = apop_data_add_page(est->info, apop_data_alloc(est->parameters->vector->size, 2), ""); apop_name_add(ep->names, "p value", 'c'); apop_name_add(ep->names, "confidence", 'c'); apop_name_stack(ep->names, est->parameters->names, 'r', 'r'); Get_vmsizes(est->data); //msize1, vsize int df = msize1 ? msize1 : vsize; df -= est->parameters->vector->size; df = df < 1 ? 1 : df; //some models aren't data-oriented. apop_data_add_named_elmt(est->info, "df", df); apop_data *one_elmt = apop_data_calloc(1, 1); gsl_vector *param_v = apop_data_pack(est->parameters); for (size_t i=0; i< est->parameters->vector->size; i++){ Apop_settings_add_group(est, apop_pm, .index=i); apop_model *m = apop_parameter_model(est->data, est); double zero = apop_cdf(one_elmt, m); apop_model_free(m); double conf = 2*fabs(0.5-zero); //parameter is always at 0.5 along a symmetric CDF apop_data_set(ep, i, .colname="confidence", .val=conf); apop_data_set(ep, i, .colname="p value", .val=1-conf); } gsl_vector_free(param_v); apop_data_free(one_elmt); } //Cut and pasted from the GNU std library documentation, modified to consider NaNs: static int compare_doubles (const void *a, const void *b) { const double *da = (const double *) a; const double *db = (const double *) b; if (gsl_isnan(*da)) return gsl_isnan(*db) ? 0 : 1; if (gsl_isnan(*db)) return -1; return (*da > *db) - (*da < *db); } typedef const char * ccp; static int strcmpwrap(const void *a, const void *b){ const ccp *aa = a; const ccp *bb = b; return strcmp(*aa, *bb); } /** Give me a vector of numbers, and I'll give you a sorted list of the unique elements. This is basically running select distinct datacol from data order by datacol, but without the aid of the database. \param v a vector of items \return a sorted vector of the distinct elements that appear in the input. \li NaNs (if any) appear at the end of the sort order. \see apop_text_unique_elements */ gsl_vector * apop_vector_unique_elements(const gsl_vector *v){ size_t prior_elmt_ctr = 107; size_t elmt_ctr = 0; double *elmts = NULL; for (size_t i=0; i< v->size; i++){ if (prior_elmt_ctr != elmt_ctr) elmts = realloc(elmts, sizeof(double)*(elmt_ctr+1)); prior_elmt_ctr = elmt_ctr; double val = gsl_vector_get(v, i); lsearch(&val, elmts, &elmt_ctr, sizeof(double), compare_doubles); if (prior_elmt_ctr < elmt_ctr) qsort(elmts, elmt_ctr, sizeof(double), compare_doubles); } gsl_vector *out = apop_array_to_vector(elmts, elmt_ctr); free(elmts); return out; } /** Give me a column of text, and I'll give you a sorted list of the unique elements. This is basically running select distinct * from datacolumn, but without the aid of the database. \param d An \ref apop_data set with a text component \param col The text column you want me to use. \return An \ref apop_data set with a single sorted column of text, where each unique text input appears once. \see apop_vector_unique_elements */ apop_data * apop_text_unique_elements(const apop_data *d, size_t col){ char **tval; //first element for free size_t prior_elmt_ctr, elmt_ctr = 1; char **telmts = malloc(sizeof(char**)*2); telmts[0] = d->text[0][col]; for (int i=1; i< d->textsize[0]; i++){ prior_elmt_ctr = elmt_ctr; tval = &(d->text[i][col]); lsearch (tval, telmts, &elmt_ctr, sizeof(char*), strcmpwrap); if (prior_elmt_ctr < elmt_ctr){ qsort(telmts, elmt_ctr, sizeof(char*), strcmpwrap); telmts = realloc(telmts, sizeof(char**)*(elmt_ctr+1)); } } //pack and ship apop_data *out = apop_text_alloc(NULL, elmt_ctr, 1); for (int j=0; j< elmt_ctr; j++) apop_text_set(out, j, 0, telmts[j]); free(telmts); return out; } static char *apop_get_factor_basename(apop_data *d, int col, char type){ char *name; char *catname = d->names == NULL ? NULL : type == 't' && d->names && d->names->textct > col ? d->names->text[col] : col == -1 && d->names && d->names->vector ? d->names->vector : col >=0 && d->names && d->names->colct > col ? d->names->col[col] : NULL; if (catname){ Asprintf(&name, "%s", catname); return name; } if (type == 't'){ Asprintf(&name, "text column %i", col); return name; } if (col == -1) Asprintf(&name, "vector"); else Asprintf(&name, "column %i", col); return name; } static char *make_catname (apop_data *d, int col, char type){ char *name, *subname = apop_get_factor_basename(d, col, type); Asprintf(&name, "", subname); free(subname); return name; } /** Factor names are stored in an auxiliary table with a name like "". Producing this name is annoying (and prevents us from eventually making it human-language independent), so use this function to get the list of factor names. \param data The data set. (No default, must not be \c NULL) \param col The column in the main data set whose name I'll use to check for the factor name list. Vector==-1. (default=0) \param type If you are referring to a text column, use 't'. (default='d') \return A pointer to the page in the data set with the given factor names. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_get_factor_names(apop_data *data, int col, char type){ #else apop_varad_head(apop_data *, apop_data_get_factor_names){ apop_data *apop_varad_var(data, NULL) Apop_stopif(!data, return NULL, 1, "You sent me a NULL data set. Returning NULL."); int apop_varad_var(col, 0) char apop_varad_var(type, 'd') return apop_data_get_factor_names_base(data, col, type); } apop_data * apop_data_get_factor_names_base(apop_data *data, int col, char type){ #endif char *name = make_catname (data, col, type); apop_data *out = apop_data_get_page(data, name, .match='e'); free(name); return out; } apop_data * create_factor_list(apop_data *d, int col, char type){ //first, create an ordered list of unique elements. //Record that list for use in this function, and in a ->more page of the data set. char *catname = make_catname(d, col, type); apop_data *factor_list; if (type == 't'){ factor_list = apop_data_add_page(d, apop_text_unique_elements(d, col), catname); size_t elmt_ctr = factor_list->textsize[0]; //awkward format conversion: factor_list->vector = gsl_vector_alloc(elmt_ctr); for (size_t i=0; i< factor_list->vector->size; i++) apop_data_set(factor_list, i, -1, i); } else { gsl_vector *delmts = apop_vector_unique_elements(Apop_cv(d, col)); factor_list = apop_data_add_page(d, apop_data_alloc(), catname); factor_list->vector = delmts; apop_text_alloc(factor_list, delmts->size, 1); for (size_t i=0; i< factor_list->vector->size; i++){ //shift to the text, for conformity with the more common text version. apop_text_set(factor_list, i, 0, "%g", gsl_vector_get(delmts, i)); } } free(catname); return factor_list; } /* Producing dummies consists of finding the index of element i, for all i, then setting (i, index) to one. Producing factors consists of finding the index and then setting (i, datacol) to index. Otherwise the work is basically identical. Also, add a ->more page to the input data giving the translation. */ static apop_data * dummies_and_factors_core(apop_data *d, int col, char type, int keep_first, int datacol, char dummyfactor, apop_data **factor_list){ if (!(*factor_list=apop_data_get_factor_names(d, col, type))) *factor_list = create_factor_list(d, col, type); Get_vmsizes((*factor_list)); //maxsize size_t elmt_ctr = maxsize; //copy the strings to a single list-of-strings apop_data *telmts = *(*factor_list)->textsize ? apop_data_transpose(*factor_list, .inplace='n'):NULL; gsl_vector *delmts = (*factor_list)->vector; //Now go through the input vector, and for row i find the posn of the vector's //name in the element list created above (j), then change (i,j) in //the dummy matrix to one. int s = type == 't' ? d->textsize[0] : (col >=0 ? d->matrix->size1 : d->vector->size); apop_data *out = (dummyfactor == 'd') ? apop_data_calloc(0, s, (keep_first!='n' ? elmt_ctr : elmt_ctr-1)) : d; size_t index; for (size_t i=0; i< s; i++){ if (type == 'd'){ double val = apop_data_get(d, i, col); size_t posn = (size_t)bsearch(&val, delmts->data, elmt_ctr, sizeof(double), compare_doubles); if (posn ) index = (posn - (size_t)delmts->data)/sizeof(double); else { index = elmt_ctr++; (*factor_list)->vector = apop_vector_realloc((*factor_list)->vector, elmt_ctr); gsl_vector_set((*factor_list)->vector, index, val); out->matrix = apop_matrix_realloc(out->matrix, out->matrix->size1, elmt_ctr); gsl_vector_set_zero(Apop_cv(out, index)); } } else { size_t posn = (size_t)bsearch(&(d->text[i][col]), *telmts->text, elmt_ctr, sizeof(char**), strcmpwrap); if (posn) index = (posn - (size_t)*telmts->text)/sizeof(char**); else { index = elmt_ctr++; *factor_list = apop_text_alloc(*factor_list, elmt_ctr, 1); apop_text_set(*factor_list, index, 0, d->text[i][col]); (*factor_list)->vector = apop_vector_realloc((*factor_list)->vector, elmt_ctr); apop_data_set(*factor_list, index, -1, index); telmts = apop_text_alloc(telmts, 1, elmt_ctr); apop_text_set(telmts, 0, index, d->text[i][col]); if (dummyfactor == 'd'){ out->matrix = apop_matrix_realloc(out->matrix, out->matrix->size1, out->matrix->size2+1); gsl_vector_set_zero(Apop_cv(out, out->matrix->size2-1)); } } } if (dummyfactor == 'd'){ if (keep_first!='n') gsl_matrix_set(out->matrix, i, index,1); else if (index > 0) //else don't keep first and index==0; throw it out. gsl_matrix_set(out->matrix, i, index-1, 1); } else apop_data_set(out, i, datacol, index); } //Add names: if (dummyfactor == 'd'){ char *basename = apop_get_factor_basename(d, col, type); for (size_t i = (keep_first!='n') ? 0 : 1; i< elmt_ctr; i++){ char n[1000]; if (type =='d'){ sprintf(n, "%s dummy %g", basename, gsl_vector_get(delmts,i)); } else sprintf(n, "%s", telmts->text[0][i]); apop_name_add(out->names, n, 'c'); } } apop_data_free(telmts); return out; } /** A utility to make a matrix of dummy variables. You give me a single vector that lists the category number for each item, and I'll produce a matrix with a single one in each row in the column specified. After that, you have to decide what to do with the new matrix and the original data column. \li You can manually join the dummy data set with your main data, e.g.: \code apop_data *dummies = apop_data_to_dummies(main_regression_vars, .col=8, .type='t'); apop_data_stack(main_regression_vars, dummies, 'c', .inplace='y'); \endcode \li The .remove='y' option specifies that I should use \ref apop_data_rm_columns to remove the column used to generate the dummies. Implemented only for type=='d'. \li By specifying .append='y' or .append='e' I will run the above two lines for you. Your \ref apop_data pointer will not change, but its \c matrix element will be reallocated (via \ref apop_data_stack). \li By specifying .append='i', I will place the matrix of dummies in place, immediately after the data column you had specified. You will probably use this with .remove='y' to replace the single column with the new set of dummy columns. Bear in mind that if there are two or more dummy columns, adding columns will change subsequent column numbers; use \ref apop_name_find to find columns instead of giving an explicit column number. \li If .append='i' and you asked for a text column, I will append to the end of the table, which is equivalent to append='e'. \param d The data set with the column to be dummified (No default.) \param col The column number to be transformed; -1==vector (default = 0) \param type 'd'==data column, 't'==text column. (default = 't') \param keep_first If \c 'n', return a matrix where each row has a one in the (column specified minus one). That is, the zeroth category is dropped, the first category has an entry in column zero, et cetera. If you don't know why this is useful, then this is what you need. If you know what you're doing and need something special, set this to \c 'y' and the first category won't be dropped. (default = \c 'n') \param append If \c 'e' or \c 'y', append the dummy grid to the end of the original data matrix. If \c 'i', insert in place, immediately after the original data column. (default = \c 'n') \param remove If \c 'y', remove the original data or text column. (default = \c 'n') \return An \ref apop_data set whose \c matrix element is the one-zero matrix of dummies. If you used .append, then this is the main matrix. Also, I add a page named "\" giving a reference table of names and column numbers (where your_var is the appropriate column heading). \exception out->error=='a' allocation error \exception out->error=='d' dimension error \li Use \ref apop_data_get_factor_names to get the list of category names. \li NaNs (if any) appear at the end of the sort order. \li See \ref fact for further discussion. \li See the documentation for \ref apop_logit for a sample linear model using this function. \li This function uses the \ref designated syntax for inputs. \see \ref apop_data_to_factors */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_to_dummies(apop_data *d, int col, char type, int keep_first, char append, char remove){ #else apop_varad_head(apop_data *, apop_data_to_dummies){ apop_data *apop_varad_var(d, NULL) Apop_stopif(!d, return NULL, 1, "You sent me a NULL data set for apop_data_to_dummies. Returning NULL."); int apop_varad_var(col, 0) char apop_varad_var(type, 't') int apop_varad_var(keep_first, 'n') char apop_varad_var(append, 'n') char apop_varad_var(remove, 'n') if (remove =='y' && type == 't') Apop_notify(1, "Remove isn't implemented for text source columns yet."); return apop_data_to_dummies_base(d, col, type, keep_first, append, remove); } apop_data * apop_data_to_dummies_base(apop_data *d, int col, char type, int keep_first, char append, char remove){ #endif if (type == 'd'){ Apop_stopif((col == -1) && d->vector, apop_return_data_error(d), 0, "You asked for the vector element " "(col==-1) but the data's vector element is NULL."); Apop_stopif((col != -1) && (col >= d->matrix->size2), apop_return_data_error(d), 0, "You asked for the matrix element %i " "but the data's matrix element has only %zu columns.", col, d->matrix->size2); } else Apop_stopif(col >= d->textsize[1], apop_return_data_error(d), 0, "You asked for the text element %i but " "the data's text element has only %zu elements.", col, d->textsize[1]); apop_data *fdummy; apop_data *dummies= dummies_and_factors_core(d, col, type, keep_first, 0, 'd', &fdummy); //Now process the append and remove options. size_t orig_size = d->matrix ? d->matrix->size1 : 0; int rm_list[orig_size+1]; memset (rm_list, 0, (orig_size+1)*sizeof(int)); if (append =='i'){ apop_data **split = apop_data_split(d, col+1, 'c'); //stack names, then matrices for (int i=0; i < d->names->colct; i++) free(d->names->col[i]); apop_name_stack(d->names, split[0]->names, 'c'); for (int k = d->names->colct; k < (split[0]->matrix ? split[0]->matrix->size2 : 0); k++) apop_name_add(d->names, "", 'c'); //pad so the name stacking is aligned (if needed) apop_name_stack(d->names, dummies->names, 'c'); apop_name_stack(d->names, split[1]->names, 'c'); gsl_matrix_free(d->matrix); d->matrix = apop_matrix_stack(split[0]->matrix, dummies->matrix, 'c'); apop_data_free(dummies); apop_data_free(split[0]); apop_matrix_stack(d->matrix, split[1]->matrix, 'c', .inplace='y'); apop_data_free(split[1]); return d; } if (remove!='n' && type!='t'){ rm_list[col]=1; apop_data_rm_columns(d, rm_list); } if (append =='y' || append == 'e' || append ==1 || (append=='i' && type=='t')){ d = apop_data_stack(d, dummies, 'c', .inplace='y'); apop_data_free(dummies); return d; } return dummies; } /** Convert a column of text or numbers into a column of numeric factors, which you can use for a multinomial probit/logit, for example. If you don't run this on your data first, \ref apop_probit and \ref apop_logit default to running it on the vector or (if no vector) zeroth column of the matrix of the input \ref apop_data set, because those models need a list of the unique values of the dependent variable. \param data The data set to be modified in place. (No default. If \c NULL, returns \c NULL and a warning) \param intype If \c 't', then \c incol refers to text, if \c 'd', refers to the vector or matrix. (default = \c 't') \param incol The column in the text that will be converted. -1 is the vector. (default = 0) \param outcol The column in the data set where the numeric factors will be written (-1 means the vector). (default = 0) For example: \code apop_data *d = apop_query_to_mixed_data("mmt", "select 0, year, color from data"); apop_data_to_factors(d); \endcode Notice that the query pulled a column of zeros for the sake of saving room for the factors. It reads column zero of the text, and writes it to column zero of the matrix. Another example: \code apop_data *d = apop_query_to_data("mmt", "select type, year from data"); apop_data_to_factors(d, .intype='d', .incol=0, .outcol=0); \endcode Here, the \c type column is converted to sequential integer factors and those factors overwrite the original data. Since a reference table is added as a second page of the \ref apop_data set, you can recover the original values as needed. \return A table of the factors used in the code. This is an \c apop_data set with only one column of text. Also, I add a page named "" giving a reference table of names and column numbers (where your_var is the appropriate column heading) use \ref apop_data_get_factor_names to retrieve that table. \exception out->error=='a' allocation error. \exception out->error=='d' dimension error. \li If the vector or matrix you wanted to write to is \c NULL, I will allocate it for you. \li See \ref fact for further discussion. \li See the documentation for \ref apop_logit for a sample linear model using this function. \li This function uses the \ref designated syntax for inputs. \see \ref apop_data_to_dummies */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_to_factors(apop_data *data, char intype, int incol, int outcol){ #else apop_varad_head(apop_data *, apop_data_to_factors){ apop_data *apop_varad_var(data, NULL) Apop_stopif(!data, return NULL, 1, "You sent me a NULL data set. Returning NULL."); int apop_varad_var(incol, 0) int apop_varad_var(outcol, 0) char apop_varad_var(intype, 't') return apop_data_to_factors_base(data, intype, incol, outcol); } apop_data * apop_data_to_factors_base(apop_data *data, char intype, int incol, int outcol){ #endif if (intype=='t'){ Apop_stopif(incol >= data->textsize[1], apop_return_data_error(d), 0, "You asked for the text column %i but the " "data's text has only %zu elements.", incol, data->textsize[1]); } else { Apop_stopif((incol == -1) && !data->vector, apop_return_data_error(d), 0, "You asked for the vector of the data set but there is none."); Apop_stopif((incol != -1) && !data->matrix, apop_return_data_error(d), 0, "You asked for the matrix column %i but the matrix is NULL.", incol); Apop_stopif((incol != -1) && (incol >= data->matrix->size2), apop_return_data_error(d), 0, "You asked for the matrix column %i but " "the matrix has only %zu elements.", incol, data->matrix->size2); } if (!data->vector && outcol == -1) //allocate a vector for the user. data->vector = gsl_vector_alloc(intype=='t' ? data->textsize[0] : data->matrix->size2); if (!data->matrix && outcol >= 0) //allocate a matrxi for the user. data->matrix = gsl_matrix_calloc(intype=='t' ? data->textsize[0] : data->matrix->size2, outcol+1); apop_data *out; dummies_and_factors_core(data, incol, intype, 1, outcol, 'f', &out); return out; } /** Deprecated. Use \ref apop_data_to_factors. Convert a column of text in the text portion of an \c apop_data set into a column of numeric elements, which you can use for a multinomial probit, for example. \param d The data set to be modified in place. \param datacol The column in the data set where the numeric factors will be written (-1 means the vector, which I will allocate for you if it is \c NULL) \param textcol The column in the text that will be converted. For example: \code apop_data *d = apop_query_to_mixed_data("mmt", "select 1, year, color from data"); apop_text_to_factors(d, 0, 0); \endcode Notice that the query pulled a column of ones for the sake of saving room for the factors. \return A table of the factors used in the code. This is an \c apop_data set with only one column of text. Also, the more element is a reference table of names and column numbers. \exception out->error=='d' dimension error. */ apop_data *apop_text_to_factors(apop_data *d, size_t textcol, int datacol){ Apop_stopif(textcol >= d->textsize[1], apop_return_data_error(d), 0, "You asked for the text element %i but the data's " "text has only %zu elements.", datacol, d->textsize[1]); if (!d->vector && datacol == -1) //allocate a vector for the user. d->vector = gsl_vector_alloc(d->textsize[0]); apop_data *out; dummies_and_factors_core(d, textcol, 't', 1, datacol, 'f', &out); return out; } /** Also known as \f$R^2\f$. Let \f$Y\f$ be the dependent variable, \f$\epsilon\f$ the residual, \f$n\f$ the number of data points, and \f$k\f$ the number of independent vars (including the constant). Returns an \ref apop_data set with the following entries (in the vector element): \li \f$ SST \equiv \sum (Y_i - \bar Y) ^2 \f$ \li \f$ SSE \equiv \sum \epsilon ^2 \f$ \li \f$ R^2 \equiv 1 - {SSE\over SST} \f$ \li \f$ R^2_{adj} \equiv R^2 - {(k-1)\over (n-k-1)}(1-R^2) \f$ Internally allocates (and frees) a vector the size of your data set. \return A \f$5 \times 1\f$ apop_data table with the following fields: \li "R squared" \li "R squared adj" \li "SSE" \li "SST" \li "SSR" If the output is in \c sss, use apop_data_get(sss, .rowname="SSE") to get the SSE, and so on for the other items. \param m A model. I use the pointer to the data set used for estimation and the info page named \c "". The Predicted page should include observed, expected, and residual columns, which I use to generate the sums of squared errors and residuals, et cetera. All generalized linear models produce a page with this name and of this form, as do a host of other models. Nothing keeps you from finding the \f$R^2\f$ of, say, a kernel smooth; it is up to you to determine whether such a thing is appropriate to your given models and situation. \li apop_estimate(yourdata, apop_ols) does this automatically \li If I don't find a "" page, print an error (iff apop_opts.verbose >=0) and return \c NULL. \li The number of observations equals the number of rows in the Predicted page \li The number of independent variables, needed only for the adjusted \f$R^2\f$, is from the number of columns in the main data set's matrix (i.e. the first page; i.e. the set of parameters if this is the \c parameters output from a model estimation). \li If your data (first page again) has a \c weights vector, I will find weighted SSE, SST, and SSR (and calculate the \f$R^2\f$s using those values). */ apop_data *apop_estimate_coefficient_of_determination (apop_model *m){ double sse, sst, rsq, adjustment; size_t indep_ct= m->data->matrix->size2 - 1; apop_data *out = apop_data_alloc(); gsl_vector *weights = m->data->weights; //typically NULL. apop_data *expected = apop_data_get_page(m->info, ""); Apop_stopif(!expected, return NULL, 0, "I couldn't find a \"\" page in your data set. Returning NULL.\n"); size_t obs = expected->matrix->size1; Apop_col_tv(expected, "residual", v) if (!weights) gsl_blas_ddot(v, v, &sse); else { gsl_vector *v_times_w = apop_vector_copy(weights); gsl_vector_mul(v_times_w, v); gsl_blas_ddot(v_times_w, v, &sse); gsl_vector_free(v_times_w); } gsl_vector *vv = Apop_cv(expected, 0); sst = apop_vector_var(vv, m->data->weights) * (vv->size-1); rsq = 1. - (sse/sst); adjustment = ((obs -1.) /(obs - indep_ct)) * (1.-rsq) ; apop_data_add_named_elmt(out, "R squared", rsq); apop_data_add_named_elmt(out, "R squared adj", 1 - adjustment); apop_data_add_named_elmt(out, "SSE", sse); apop_data_add_named_elmt(out, "SST", sst); apop_data_add_named_elmt(out, "SSR", sst - sse); return out; } /** \def apop_estimate_r_squared(in) A synonym for \ref apop_estimate_coefficient_of_determination, q.v. \hideinitializer */ apophenia-1.0+ds/apop_settings.c000066400000000000000000000113731262736346100167620ustar00rootroot00000000000000/** \file Specifying model characteristics and details of estimation methods. */ /* Copyright (c) 2008--2009, 2011, 2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" static size_t get_settings_ct(apop_model *model){ int ct =0; if (!model->settings) return 0; while (model->settings[ct].name[0] !='\0') ct++; return ct; } //The Dan J Bernstein string hashing algorithm. //Could conceivably save a lot of time under certain settings-heavy circumstances. static unsigned long apop_settings_hash(char *str){ unsigned long int hash = 5381; char c; while ((c = *str++)) hash = hash*33 + c; return hash; } /* Remove a settings group from a model. Use \ref Apop_settings_rm_group. That macro uses this function internally. */ void apop_settings_remove_group(apop_model *m, char *delme){ if (!m->settings) return; int i = 0; int ct = get_settings_ct(m); unsigned long delme_hash = apop_settings_hash(delme); while (m->settings[i].name[0] !='\0'){ if (m->settings[i].name_hash == delme_hash){ ((void (*)(void*))m->settings[i].free)(m->settings[i].setting_group); for (int j=i+1; j< ct+1; j++) //don't forget the null sentinel. m->settings[j-1] = m->settings[j]; i--; } i++; } // apop_assert_void(0, 1, 'c', "I couldn't find %s in the input model, so nothing was removed.", delme); } /* Don't use this function. It's what the \c Apop_model_add_group macro uses internally. Use that. */ void *apop_settings_group_alloc(apop_model *model, char *type, void *free_fn, void *copy_fn, void *the_group){ if(apop_settings_get_grp(model, type, 'c')) apop_settings_remove_group(model, type); int ct = get_settings_ct(model); model->settings = realloc(model->settings, sizeof(apop_settings_type)*(ct+2)); model->settings[ct] = (apop_settings_type) { .setting_group = the_group, .name_hash = apop_settings_hash(type), .free= free_fn, .copy = copy_fn }; strncpy(model->settings[ct].name, type, 100); model->settings[ct+1] = (apop_settings_type) { }; return model->settings[ct].setting_group; } //need this for the apop_settings_model_group_alloc macro. apop_model *apop_settings_group_alloc_wm(apop_model *model, char *type, void *free_fn, void *copy_fn, void *the_group){ apop_settings_group_alloc(model, type, free_fn, copy_fn, the_group); return model; } /* This function is used internally by the macro \ref Apop_settings_get_group. Use that. */ void * apop_settings_get_grp(apop_model *m, char *type, char fail){ //Used only for finding the non-blank groups. Apop_stopif(!m, return NULL, 0, "you gave me a NULL model as input."); if (!m->settings) return NULL; int i; unsigned long type_hash = apop_settings_hash(type); for (i=0; m->settings[i].name[0] !='\0'; i++) if (type_hash == m->settings[i].name_hash) return m->settings[i].setting_group; Apop_assert(fail != 'f', "I couldn't find the settings group %s in the given model.", type); return NULL; //else, just return NULL and let the caller sort it out. } /** Copy a settings group with the given name from the second model to the first (i.e., the arguments are in memcpy order). You probably won't need this often---just use \ref apop_model_copy. \param outm The model that will receive a copy of the settings group. \param inm The model that will provide the original. \param copyme The string naming the group. For example, for an \ref apop_mcmc_settings group, this would be \c "apop_mcmc". \exception outm->error=='s' Error copying settings group. */ void apop_settings_copy_group(apop_model *outm, apop_model *inm, char *copyme){ if (!copyme || !strlen(copyme)) return; //apop_settings_group_alloc takes care of the blank sentinel. Apop_stopif(!inm, if (outm) outm->error = 's'; return, 0, "you asked me to copy the settings of a NULL model."); Apop_stopif(!inm->settings, return, 0, "The input model (i.e., the second argument to this function) has no settings."); void *g = apop_settings_get_grp(inm, copyme, 'c'); Apop_stopif(!g, outm->error='s'; return, 0, "Couldn't find the group you wanted me to copy. Not copying anything; setting outmodel->error='s'."); int i; unsigned long type_hash = apop_settings_hash(copyme); for (i=0; inm->settings[i].name[0] !='\0'; i++)//retrieve the index. if (type_hash == inm->settings[i].name_hash) break; void *gnew = (inm->settings[i].copy) ? ((void *(*)(void*))inm->settings[i].copy)(g) : g; apop_settings_group_alloc(outm, copyme, inm->settings[i].free, inm->settings[i].copy, gnew); } apophenia-1.0+ds/apop_sort.c000066400000000000000000000243561262736346100161160ustar00rootroot00000000000000 /** \file apop_sort.c Copyright (c) 2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include static double find_smallest_larger_than(apop_data const *sort_order, double *x){ //the next column in the sort order is the one that is not NAN, greater than x, but smaller than all other candidate values. double candidate_col=-100, candidate_val = INFINITY, v; if (sort_order->vector && !isnan(v=gsl_vector_get(sort_order->vector, 0)) && v > *x && v < candidate_val){ candidate_val = v; candidate_col = -1; } if (sort_order->matrix) for (int i=0; i< sort_order->matrix->size2; i++) if (!isnan(v=gsl_matrix_get(sort_order->matrix, 0, i)) && v > *x && v < candidate_val){ candidate_val = v; candidate_col = i; } if (*sort_order->textsize) for (int i=0; i< sort_order->textsize[1]; i++) if (apop_opts.nan_string && strcmp(sort_order->text[0][i], apop_opts.nan_string) && (v=atof(sort_order->text[0][i])) > *x && v < candidate_val){ candidate_val = v; candidate_col = i+0.5; } if (sort_order->weights && !isnan(v=gsl_vector_get(sort_order->weights, 0)) && v > *x && v < candidate_val){ candidate_val = v; candidate_col = -2; } if (sort_order->names->rowct) if (apop_opts.nan_string && strcmp(*sort_order->names->row, apop_opts.nan_string) && (v=atof(*sort_order->names->row)) > *x && v < candidate_val){ candidate_val = v; candidate_col = 0.2; } *x = candidate_val; return candidate_col; } static void generate_sort_order(apop_data const *data, apop_data const *sort_order, int cols_to_sort_ct, double *so){ /* the internal rule is that the vector is -1, the weights vector is -2, the names are * 0.2, and the text cols are the column+0.5. How's that for arbitrary. */ if (sort_order) { double x = -INFINITY; for (int i=0; i< cols_to_sort_ct-1; i++) so[i] = find_smallest_larger_than(sort_order, &x); } else { int ctr=0; if (data->vector) so[ctr++] = -1; if (data->matrix) for(int i=0; imatrix->size2; i++) so[ctr++] = i; if (*data->textsize) for(int i=0; i< data->textsize[1]; i++) so[ctr++] = i+0.5; if (data->weights) so[ctr++] = -2; if (data->names->rowct) so[ctr++] = 0.2; } so[cols_to_sort_ct-1] = -100; } #include static int find_min_unsorted(size_t *sorted, size_t height, size_t min){ while (minnames->row[*da], d->names->row[*db]) : strcasecmp(d->text[*da][offset], d->text[*db][offset]); } static void rearrange(apop_data *data, size_t height, size_t *perm){ size_t i, start=0; size_t *sorted = calloc(height, sizeof(size_t)); while (1){ i = start = find_min_unsorted(sorted, height, start); if (i==-1) break; apop_data *first_row_storage = apop_data_copy(Apop_r(data, start)); sorted[start]++; while (perm[i]!=start){ //copy from perm[i] to i apop_data_memcpy(Apop_r(data,i), Apop_r(data, perm[i])); sorted[perm[i]]++; i = perm[i]; } apop_data_memcpy(Apop_r(data, i), first_row_storage); apop_data_free(first_row_storage); } free(sorted); } /** Sort an \ref apop_data set on an arbitrary sequence of columns. The \c sort_order set is a one-row data set that should look like the data set being sorted. The easiest way to generate it is to use \ref Apop_r to pull one row of the table, then copy and fill it. For each column you want used in the sort, assign a ranking giving whether the column should be sorted first, second, .... Columns you don't want used in the sorting should be set to \c NAN. Ties are broken by the earlier element in the default order (see below). E.g., to sort by the last column of a five-column matrix first, then the next-to-last column, then the next-to-next-to-last, then by the first text column, then by the second text column: \code apop_data *sort_order = apop_data_copy(Apop_r(data, 0)); sort_order->vector = NULL; //so it will be skipped. Apop_data_fill(sort_order, NAN, NAN, 3, 2, 1); apop_text_set(sort_order, 0, 0, "4"); apop_text_set(sort_order, 0, 1, "5"); apop_data_sort(data, sort_order); \endcode To determine which columns are sorted at which step, I use only comparisons, not the actual numeric values. For example, (1, 2, 3) and (-1.32, 0, 27) work identically. For text, I use \c atof to convert the your text to a number, as in the example above that set text values of \c "4" and \c "5". A blank string, NaN numeric value, or NULL element in the \ref apop_data set means that column will not be sorted. \li Strings are sorted case-insensitively, using \c strcasecmp. [exercise for the reader: modify the source to use Glib's locale-correct string sorting.] \li The setup generates a lexicographic sort using the columns you specify. If you would like a different sort order, such as Euclidian distance to the origin, you can generate a new column expressing your preferred metric, and then sorting on that. See the example below. \param data The data set to be sorted. If \c NULL, this function is a no-op that returns \c NULL. \param sort_order An \ref apop_data set describing the order in which columns are used for sorting, as above. If \c NULL, then sort by the vector, then each matrix column, then text, then weights, then row names. \param inplace If 'n', make a copy, else sort in place. (default: 'y'). \param asc If 'a', ascending; if 'd', descending. This is applied to all columns; column-by-column application is to do. (default: 'a'). \param col_order For internal use only. In your call, it should be \c NULL; you can leave this off your function call entirely and the \ref designated syntax will takes care of it for you. \return A pointer to the sorted data set. If inplace=='y' (the default), then this is the same as the input set. A few examples: \include "sort_example.c" \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_data * apop_data_sort(apop_data *data, apop_data *sort_order, char asc, char inplace, double *col_order){ #else apop_varad_head(apop_data *, apop_data_sort){ apop_data * apop_varad_var(data, NULL); Apop_stopif(!data, return NULL, 1, "You gave me NULL data to sort. Returning NULL"); apop_data * apop_varad_var(sort_order, NULL); char apop_varad_var(inplace, 'y'); char apop_varad_var(asc, 'a'); double * apop_varad_var(col_order, NULL); return apop_data_sort_base(data, sort_order, asc, inplace, col_order); } apop_data * apop_data_sort_base(apop_data *data, apop_data *sort_order, char asc, char inplace, double *col_order){ #endif if (!data) return NULL; apop_data *out = inplace=='n' ? apop_data_copy(data) : data; apop_data *xx = sort_order ? sort_order : out; Get_vmsizes(xx); //firstcol, msize2 int cols_to_sort_ct = msize2 - firstcol +1 + !!(xx->weights) + xx->textsize[1] + !!xx->names->rowct; double so[cols_to_sort_ct]; if (!col_order){ generate_sort_order(out, sort_order, cols_to_sort_ct, so); col_order = so; } bool is_text = ((int)*col_order != *col_order); bool is_name = (*col_order == 0.2); gsl_vector_view c; gsl_vector *cc = NULL; if (!is_text && *col_order>=0){ c = gsl_matrix_column(out->matrix, *col_order); cc = &c.vector; } gsl_vector *thiscol = cc ? cc : (*col_order==-2) ? out->weights : (*col_order==-1) ? out->vector : NULL; size_t height = thiscol ? thiscol->size : is_name ? out->names->rowct : *out->textsize; gsl_permutation *p = gsl_permutation_alloc(height); if (!is_text) gsl_sort_vector_index (p, thiscol); else { gsl_permutation_init(p); d = out; offset = is_name ? -1 : *col_order-0.5; qsort(p->data, height, sizeof(size_t), compare_strings); } size_t *perm = p->data; if (asc=='d' || asc=='D') //reverse the perm matrix. for (size_t j=0; j< height/2; j++){ double t = perm[j]; perm[j] = perm[height-1-j]; perm[height-1-j] = t; } rearrange(out, height, perm); gsl_permutation_free(p); if (col_order[1] == -100) return out; /*Second pass: find blocks where all are of the same value. After you pass a block of size > 1 row where all vals in this col are identical, sort that block, using the rest of the sort order. */ int bottom=0; if (!is_text){ double last_val = gsl_vector_get(thiscol, 0); for (int i=1; i< height+1; i++){ double this_val=0; if ((i==height || (this_val=gsl_vector_get(thiscol, i)) != last_val) && bottom != i-1){ apop_data_sort_base(Apop_rs(out, bottom, i-bottom), sort_order, 'a', 'y', col_order+1); } if (last_val != this_val) bottom = i; last_val = this_val; } } else { char *last_val = is_name ? out->names->row[0] : out->text[0][(int)(*col_order-0.5)]; for (int i=1; i< height+1; i++){ char *this_val = i==height ? NULL : is_name ? out->names->row[i] : out->text[i][(int)(*col_order-0.5)]; if ((i==height || strcasecmp(this_val, last_val)) && bottom != i-1){ apop_data_sort_base(Apop_rs(out, bottom, i-bottom), sort_order, 'a', 'y', col_order+1); } if (this_val && strcmp(last_val, this_val)) bottom = i; last_val = this_val; } } return out; } apophenia-1.0+ds/apop_stats.c000066400000000000000000001267411262736346100162660ustar00rootroot00000000000000 /** \file apop_stats.c Basic moments and some distributions. */ /* Copyright (c) 2006--2007, 2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include #include #define Check_vw \ Apop_stopif(!v, return GSL_NAN, 0, "data vector is NULL. Returning NaN.\n"); \ Apop_stopif(!v->size, return GSL_NAN, 0, "data vector has size 0. Returning NaN.\n"); \ Apop_stopif(weights && weights->size != v->size, return GSL_NAN, 0, "data vector has size %zu; weighting vector has size %zu. Returning NaN.\n", v->size, weights->size); /** Returns the sum of the data in the given vector. */ long double apop_vector_sum(const gsl_vector *in){ Apop_stopif(!in, return 0, 1, "You just asked me to sum a NULL. Returning zero."); long double out = 0; for (size_t i=0; i< in->size; i++) out += gsl_vector_get(in, i); return out; } /** \def apop_sum(in) An alias for \ref apop_vector_sum. Returns the sum of the data in the given vector. */ /** \def apop_mean(v) Returns the mean of the elements of the vector \c v. \param v A \ref gsl_vector. */ /** \def apop_var(in) An alias for \ref apop_vector_var. Returns the variance of the data in the given vector. */ /** Returns an unbiased estimate of the sample skew of the data in the given vector. */ double apop_vector_skew(const gsl_vector *in){ return apop_vector_skew_pop(in) * gsl_pow_2(in->size)/((in->size -1.)*(in->size -2.)); } /** Returns the sample fourth central moment of the data in the given vector. Corrections are made to produce an unbiased result as per Appendix M (PDF) of Modeling with data. \li This is an estimate of the fourth central moment without normalization. The kurtosis of a \f${\cal N}(0,1)\f$ is \f$3 \sigma^4\f$, not three, one, or zero. \see \ref apop_vector_kurtosis_pop */ double apop_vector_kurtosis(const gsl_vector *in){ size_t n = in->size; long double coeff0= n*n/(gsl_pow_3(n-1)*(gsl_pow_2(n)-3*n+3)); long double coeff1= n*gsl_pow_2(n-1)+ (6*n-9); long double coeff2= n*(6*n-9); return coeff0 *(coeff1 * apop_vector_kurtosis_pop(in) - coeff2 * gsl_pow_2(apop_vector_var(in)*(n-1.)/n)); } static double wskewkurt(const gsl_vector *v, const gsl_vector *w, const int exponent, const char *fn_name){ long double wsum = 0, sumcu = 0, vv, ww, mu; //Using the E(x - \bar x)^3 form, which is lazy. mu = apop_vector_mean(v, w); for (size_t i=0; i< w->size; i++){ vv = gsl_vector_get(v,i); ww = gsl_vector_get(w,i); sumcu+= ww * gsl_pow_int(vv - mu, exponent); wsum += ww; } double len = wsum < 1.1 ? w->size : wsum; return sumcu/len; } /** Returns the population skew \f$(\sum_i (x_i - \mu)^3/n))\f$ of the data in the given vector. Observations may be weighted. \param v The data vector \param weights The weight vector. Default: equal weights for all observations. \return The weighted skew. \li Some people like to normalize the skew by dividing by (variance)\f$^{3/2}\f$; that's not done here, so you'll have to do so separately if need be. \li Apophenia tries to be smart about reading the weights. If weights sum to one, then the system uses \c w->size as the number of elements, and returns the usual sum over \f$n-1\f$. If weights > 1, then the system uses the total weights as \f$n\f$. Thus, you can use the weights as standard weightings or to represent elements that appear repeatedly. */ #ifdef APOP_NO_VARIADIC double apop_vector_skew_pop(gsl_vector const *v, gsl_vector const *weights){ #else apop_varad_head(double, apop_vector_skew_pop){ gsl_vector const * apop_varad_var(v, NULL); gsl_vector const * apop_varad_var(weights, NULL); Check_vw return apop_vector_skew_pop_base(v, weights); } double apop_vector_skew_pop_base(gsl_vector const *v, gsl_vector const *weights){ #endif if (weights) return wskewkurt(v, weights, 3, "apop_vector_weighted_skew"); //This code is cut/pasted/modified from the GSL. //I reimplement the skew calculation here without the division by var^3/2 that the GSL does. size_t n = v->size; long double avg = 0; long double mean = apop_vector_mean(v); for (size_t i = 0; i < n; i++) { const long double x = gsl_vector_get(v, i) - mean; avg += (gsl_pow_3(x) - avg)/(i + 1); } return avg; } /** Returns the population fourth central moment [\f$\sum_i (x_i - \mu)^4/n)\f$] of the data in the given vector, with an optional weighting. \param v The data vector \param weights The weight vector. If NULL, assume equal weights. \return The weighted kurtosis. \li Some people like to normalize the fourth central moment by dividing by variance squared, or by subtracting three; those things are not done here, so you'll have to do them separately if desired. \li This function uses the \ref designated syntax for inputs. \see \ref apop_vector_kurtosis for the unbiased sample version. */ #ifdef APOP_NO_VARIADIC double apop_vector_kurtosis_pop(gsl_vector const *v, gsl_vector const *weights){ #else apop_varad_head(double, apop_vector_kurtosis_pop){ gsl_vector const * apop_varad_var(v, NULL); gsl_vector const * apop_varad_var(weights, NULL); Check_vw return apop_vector_kurtosis_pop_base(v, weights); } double apop_vector_kurtosis_pop_base(gsl_vector const *v, gsl_vector const *weights){ #endif if (weights) return wskewkurt(v, weights, 4, "apop_vector_weighted_kurtosis"); //This code is cut/pasted/modified from the GSL. //I reimplement the kurtosis calculation here without the division by var^2 that the GSL does. size_t n = v->size; long double avg = 0; long double mean = apop_vector_mean(v); for (size_t i = 0; i < n; i++) { const long double x = gsl_vector_get(v, i) - mean; avg += (gsl_pow_4(x) - avg)/(i + 1); } return avg; } /** Returns the variance of the data in the given vector, given that you've already calculated the mean. \param in the vector in question \param mean the mean, which you've already calculated using \ref apop_vector_mean. \see apop_vector_var */ double apop_vector_var_m(const gsl_vector *in, const double mean){ return gsl_stats_variance_m(in->data,in->stride, in->size, mean); } /** Returns the correlation coefficient of two vectors: \f$ {\hbox{cov}(a,b)\over \sqrt{\hbox{var}(a)} \sqrt{\hbox{var}(b)}}.\f$ An example \code gsl_matrix *m = apop_text_to_data("indata")->matrix; printf("The correlation coefficient between rows two " "and three is %g.\n", apop_vector_correlation(Apop_mrv(m, 2), Apop_mrv(m, 3))); \endcode \param ina, inb Two vectors of equal length (no default, must not be NULL) \param weights Replicate weights for the observations. (default: equal weights for all observations) \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC double apop_vector_correlation(const gsl_vector *ina, const gsl_vector *inb, const gsl_vector *weights){ #else apop_varad_head(double, apop_vector_correlation){ gsl_vector const * apop_varad_var(ina, NULL); gsl_vector const * apop_varad_var(inb, NULL); gsl_vector const * apop_varad_var(weights, NULL); return apop_vector_correlation_base(ina, inb, weights); } double apop_vector_correlation_base(const gsl_vector *ina, const gsl_vector *inb, const gsl_vector *weights){ #endif return apop_vector_cov(ina, inb, weights) / sqrt(apop_vector_var(ina, weights) * apop_vector_var(inb, weights)); } /** Returns the distance between two vectors, where distance is defined based on the third (optional) parameter: - 'e' (the default): scalar distance (standard Euclidean metric) between two vectors. \f$\sqrt{\sum_i{(a_i - b_i)^2}},\f$ where \f$i\f$ iterates over dimensions. - 'm' Returns the Manhattan metric distance between two vectors: \f$\sum_i{|a_i - b_i|},\f$ where \f$i\f$ iterates over dimensions. - 'd' The discrete norm: if \f$a = b\f$, return zero, else return one. - 's' The sup norm: find the dimension where \f$|a_i - b_i|\f$ is largest, return the distance along that one dimension. - 'l' or 'L' The \f$L_p\f$ norm, \f$\left(\sum_i{|a_i - b_i|^2}\right)^{1/p}\f$. The value of \f$p\f$ is set by the fourth (optional) argument. \param ina First vector (No default, must not be \c NULL) \param inb Second vector (Default = zero) \param metric The type of metric, as above. \param norm If you are using an \f$L_p\f$ norm, this is \f$p\f$. Must be strictly greater than zero. (default = 2) \li The defaults are such that \code apop_vector_distance(v); apop_vector_distance(v, .metric = 's'); apop_vector_distance(v, .metric = 'm'); \endcode gives you the standard Euclidean length of \c v, its longest element, and its sum. \li This function uses the \ref designated syntax for inputs. \include test_distances.c */ #ifdef APOP_NO_VARIADIC double apop_vector_distance(const gsl_vector *ina, const gsl_vector *inb, const char metric, const double norm){ #else apop_varad_head(double, apop_vector_distance){ static threadlocal gsl_vector *zero = NULL; const gsl_vector * apop_varad_var(ina, NULL); Apop_stopif(!ina, return NAN, 1, "The first vector is NULL. Returning NAN"); const gsl_vector * apop_varad_var(inb, NULL); if (!inb){ if (!zero || zero->size !=ina->size){ if (zero) gsl_vector_free(zero); zero = gsl_vector_calloc(ina->size); } inb = zero; } const char apop_varad_var(metric, 'e'); const double apop_varad_var(norm, 2); return apop_vector_distance_base(ina, inb, metric, norm); } double apop_vector_distance_base(const gsl_vector *ina, const gsl_vector *inb, const char metric, const double norm){ #endif Apop_stopif(ina->size != inb->size, return GSL_NAN, 0, "I need equal-sized vectors, but " "you sent a vector of size %zu and a vector of size %zu. Returning NaN.", ina->size, inb->size); double dist = 0; if (metric == 'e' || metric == 'E'){ for (size_t i=0; i< ina->size; i++) dist += gsl_pow_2(gsl_vector_get(ina, i) - gsl_vector_get(inb, i)); return sqrt(dist); } if (metric == 'm' || metric == 'M'){ //redundant with vector_grid_distance, below. for (size_t i=0; i< ina->size; i++) dist += fabs(gsl_vector_get(ina, i) - gsl_vector_get(inb, i)); return dist; } if (metric == 'd' || metric == 'D'){ for (size_t i=0; i< ina->size; i++) if (gsl_vector_get(ina, i) != gsl_vector_get(inb, i)) return 1; return 0; } if (metric == 's' || metric == 'S'){ for (size_t i=0; i< ina->size; i++) dist = GSL_MAX(dist, fabs(gsl_vector_get(ina, i) - gsl_vector_get(inb, i))); return dist; } if (metric == 'l' || metric == 'L'){ for (size_t i=0; i< ina->size; i++) dist += pow(fabs(gsl_vector_get(ina, i) - gsl_vector_get(inb, i)), norm); return pow(dist, 1./norm); } Apop_stopif(1, return NAN, 1, "I couldn't find the metric type you gave, %c, in my list of supported types. Returning NaN", metric); } /** This function will normalize a vector, either such that it has mean zero and variance one, or ranges between zero and one, or sums to one. \param in A \c gsl_vector with the un-normalized data. \c NULL input gives \c NULL output. (No default) \param out If normalizing in place, \c NULL. If not, the address of a gsl_vector*. Do not allocate. (default = \c NULL.) \param normalization_type \c 'p': normalized vector will sum to one. E.g., start with a set of observations in bins, end with the percentage of observations in each bin. (the default)
\c 'r': normalized vector will range between zero and one. Replace each X with (X-min) / (max - min).
\c 's': normalized vector will have mean zero and (sample) variance one. Replace each X with \f$(X-\mu) / \sigma\f$, where \f$\sigma\f$ is the sample standard deviation.
\c 'm': normalize to mean zero: Replace each X with \f$(X-\mu)\f$
\b Example \code \endcode \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC void apop_vector_normalize(gsl_vector *in, gsl_vector **out, const char normalization_type){ #else apop_varad_head(void, apop_vector_normalize){ gsl_vector * apop_varad_var(in, NULL); Apop_stopif(!in, return, 1, "Input vector is NULL. Doing nothing."); gsl_vector ** apop_varad_var(out, NULL); const char apop_varad_var(normalization_type, 'p'); apop_vector_normalize_base(in, out, normalization_type); } void apop_vector_normalize_base(gsl_vector *in, gsl_vector **out, const char normalization_type){ #endif double mu, min, max; if (!out) out = ∈ else { *out = gsl_vector_alloc (in->size); gsl_vector_memcpy(*out, in); } if (normalization_type == 's'){ mu = apop_vector_mean(in); Apop_stopif(!isfinite(mu), return, 0, "normalization failed: the mean of the vector is not finite."); gsl_vector_add_constant(*out, -mu); double scaling = 1./(sqrt(apop_vector_var_m(*out, 0))); Apop_stopif(!isfinite(scaling), return, 0, "normalization failed: 1/(std error) of the vector is not finite."); gsl_vector_scale(*out, scaling); } else if (normalization_type == 'r'){ gsl_vector_minmax(in, &min, &max); gsl_vector_add_constant(*out, -min); gsl_vector_scale(*out, 1/(max-min)); } else if (normalization_type == 'p'){ long double sum = apop_sum(in); Apop_stopif(!sum, return, 0, "the vector sums to zero, so I can't normalize it to sum to one."); gsl_vector_scale(*out, 1/sum); } else if (normalization_type == 'm'){ mu = apop_vector_mean(in); Apop_stopif(!isfinite(mu), return, 0, "normalization failed: the mean of the vector is not finite."); gsl_vector_add_constant(*out, -mu); } } /** Returns the sum of the elements of a matrix. Occasionally convenient. \param m the matrix to be summed. */ long double apop_matrix_sum(const gsl_matrix *m){ Apop_stopif(!m, return 0, 1, "You just asked me to sum a NULL. Returning zero."); long double sum = 0; for (size_t j=0; j< m->size1; j++) for (size_t i=0; i< m->size2; i++) sum += gsl_matrix_get(m, j, i); return sum; } /** Returns the mean of all elements of a matrix. \param data The matrix to be averaged. If \c NULL, return zero. \return The mean of all cells of the matrix. */ double apop_matrix_mean(const gsl_matrix *data){ if (!data) return 0; long double avg = 0; int cnt = 0; for(size_t i=0; i < data->size1; i++) for(size_t j=0; j < data->size2; j++){ double x = gsl_matrix_get(data, i,j); long double ratio = cnt/(cnt+1.0); cnt++; avg*= ratio; avg+= x/cnt; } return avg; } /** Returns the mean and population variance of all elements of a matrix. \li If \c NULL, return \f$\mu=0, \sigma^2=NaN\f$. \li Gives the population variance (sum of squares divided by \f$N\f$). If you want sample variance, multiply the result by \f$N/(N-1)\f$: \code double mu, var; apop_data *data= apop_query_to_data("select * from indata"); apop_matrix_mean_and_var(data->matrix, &mu, &var); var *= (data->size1*data->size2)/(data->size1*data->size2-1.0); \endcode \param data the matrix to be averaged. \param mean where to put the mean to be calculated. \param var where to put the variance to be calculated. */ void apop_matrix_mean_and_var(const gsl_matrix *data, double *mean, double *var){ if (!data) {*mean=0; *var=GSL_NAN; return;} long double avg = 0, avg2 = 0; size_t cnt= 0; long double x, ratio; for(size_t i=0; i < data->size1; i++) for(size_t j=0; j < data->size2; j++){ x = gsl_matrix_get(data, i,j); ratio = cnt/(cnt+1.0); cnt ++; avg *= ratio; avg2 *= ratio; avg += x/(cnt +0.0); avg2 += gsl_pow_2(x)/(cnt +0.0); } *mean = avg; *var = avg2 - gsl_pow_2(avg); //E[x^2] - E^2[x] } /** Put summary information about the columns of a table (mean, std dev, variance, min, median, max) in a table. \param indata The table to be summarized. An \ref apop_data structure. May have a weights element. \return An \ref apop_data structure with one row for each column in the original table, and a column for each summary statistic. \exception out->error='a' Allocation error. \li This function gives more columns than you probably want; use \ref apop_data_prune_columns to pick the ones you want to see. \li See apop_data_prune_columns for an example. */ apop_data * apop_data_summarize(apop_data *indata){ Apop_stopif(!indata, return NULL, 0, "You sent me a NULL apop_data set. Returning NULL."); Apop_stopif(!indata->matrix, return NULL, 0, "You sent me an apop_data set with a NULL matrix. Returning NULL."); apop_data *out = apop_data_alloc(indata->matrix->size2, 6); double mean, var; char rowname[10000]; //crashes on more than 10^9995 columns. apop_name_add(out->names, "mean", 'c'); apop_name_add(out->names, "std dev", 'c'); apop_name_add(out->names, "variance", 'c'); apop_name_add(out->names, "min", 'c'); apop_name_add(out->names, "median", 'c'); apop_name_add(out->names, "max", 'c'); if (indata->names !=NULL){ apop_name_stack(out->names,indata->names, 'r', 'c'); if (indata->names->title && strlen(indata->names->title)){ char *title; Asprintf(&title, "summary for %s", indata->names->title); apop_name_add(out->names, title, 'h'); free(title); } } else for (size_t i=0; i< indata->matrix->size2; i++){ sprintf(rowname, "col %zu", i); apop_name_add(out->names, rowname, 'r'); } for (size_t i=0; i< indata->matrix->size2; i++){ gsl_vector *v = Apop_cv(indata, i); if (!indata->weights){ mean = apop_vector_mean(v); var = apop_vector_var_m(v, mean); } else { mean = apop_vector_mean(v, indata->weights); var = apop_vector_var(v, indata->weights); } double *pctiles = apop_vector_percentiles(v); gsl_matrix_set(out->matrix, i, 0, mean); gsl_matrix_set(out->matrix, i, 1, sqrt(var)); gsl_matrix_set(out->matrix, i, 2, var); gsl_matrix_set(out->matrix, i, 3, pctiles[0]); gsl_matrix_set(out->matrix, i, 4, pctiles[50]); gsl_matrix_set(out->matrix, i, 5, pctiles[100]); free(pctiles); } return out; } /** Returns an array of size 101, where \c returned_vector[95] gives the value of the 95th percentile, for example. \c Returned_vector[100] is always the maximum value, and \c returned_vector[0] is always the min (regardless of rounding rule). \param data A \c gsl_vector with the data. (No default, must not be \c NULL.) \param rounding Either be \c 'u', \c 'd', or \c 'a'. Unless your data is exactly a multiple of 101, some percentiles will be ambiguous. If \c 'u', then round up (use the next highest value); if \c 'd', round down to the next lowest value; if \c 'a', take the mean of the two nearest points. (Default = \c 'd'.) \li If the rounding method is \c 'u' or \c 'a', then you can say "5% or more of the sample is below returned_vector[5]"; if \c 'd' or \c 'a', then you can say "5% or more of the sample is above returned_vector[5]". \li You may eventually want to \c free() the array returned by this function. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC double * apop_vector_percentiles(gsl_vector *data, char rounding){ #else apop_varad_head(double *, apop_vector_percentiles){ gsl_vector *apop_varad_var(data, NULL); Apop_stopif(!data, return NULL, 0, "You gave me NULL data."); char apop_varad_var(rounding, 'd'); return apop_vector_percentiles_base(data, rounding); } double * apop_vector_percentiles_base(gsl_vector *data, char rounding){ #endif gsl_vector *sorted = gsl_vector_alloc(data->size); double *pctiles = malloc(sizeof(double) * 101); gsl_vector_memcpy(sorted,data); gsl_sort_vector(sorted); for(int i=0; i<101; i++){ int index = i*(data->size-1)/100.0; if (rounding == 'u' && index != i*(data->size-1)/100.0) index ++; //index was rounded down, but should be rounded up. if (rounding == 'a' && index != i*(data->size-1)/100.0) pctiles[i] = (gsl_vector_get(sorted, index)+gsl_vector_get(sorted, index+1))/2.; else pctiles[i] = gsl_vector_get(sorted, index); } gsl_vector_free(sorted); return pctiles; } /** Find the mean, weighted or unweighted. \param v The data vector \param weights The weight vector. Default: assume equal weights. \return The weighted mean \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC double apop_vector_mean(gsl_vector const *v, gsl_vector const *weights){ #else apop_varad_head(double, apop_vector_mean){ gsl_vector const * apop_varad_var(v, NULL); gsl_vector const * apop_varad_var(weights, NULL); Check_vw return apop_vector_mean_base(v, weights); } double apop_vector_mean_base(gsl_vector const *v, gsl_vector const *weights){ #endif if (!weights) return gsl_stats_mean(v->data, v->stride, v->size); long double sum = 0, wsum = 0; for (size_t i=0; i< weights->size; i++){ sum += gsl_vector_get(weights, i) * gsl_vector_get(v,i); wsum += gsl_vector_get(weights, i); } return sum/wsum; } /** Find the sample variance of a vector, weighted or unweighted. \param v The data vector \param weights The weight vector. If NULL (the default), assume equal weights. \return The weighted sample variance. \li This uses (n-1) in the denominator of the sum; i.e., it corrects for the bias introduced by using \f$\bar x\f$ instead of \f$\mu\f$. \li Multiply the output by (n-1)/n if you need population variance. \li Apophenia tries to be smart about reading the weights. If weights sum to one, then the system uses \c w->size as the number of elements, and returns the usual sum over \f$n-1\f$. If weights > 1, then the system uses the total weights as \f$n\f$. Thus, you can use the weights as standard weightings or to represent elements that appear repeatedly. \li This function uses the \ref designated syntax for inputs. \see apop_vector_var_m for the case where you already have the vector's mean. */ #ifdef APOP_NO_VARIADIC double apop_vector_var(gsl_vector const *v, gsl_vector const *weights){ #else apop_varad_head(double, apop_vector_var){ gsl_vector const * apop_varad_var(v, NULL); gsl_vector const * apop_varad_var(weights, NULL); Check_vw return apop_vector_var_base(v, weights); } double apop_vector_var_base(gsl_vector const *v, gsl_vector const *weights){ #endif if (!weights) return gsl_stats_variance(v->data, v->stride, v->size); //Using the E(x^2) - E^2(x) form. long double sum = 0, wsum = 0, sumsq = 0, vv, ww; for (size_t i=0; i< weights->size; i++){ vv = gsl_vector_get(v, i); ww = gsl_vector_get(weights, i); sum += ww * vv; sumsq += ww * gsl_pow_2(vv); wsum += ww; } double len = (wsum < 1.1 ? weights->size : wsum); return (sumsq/len - gsl_pow_2(sum/len)) * len/(len -1.); } /** Find the sample covariance of a pair of vectors, with an optional weighting. This only makes sense if the weightings are identical, so the function takes only one weighting vector for both. \param v1, v2 The data vectors (no default; must not be \c NULL) \param weights The weight vector. (default equal weights for all elements) \return The sample covariance \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC double apop_vector_cov(const gsl_vector *v1, const gsl_vector *v2, const gsl_vector *weights){ #else apop_varad_head(double, apop_vector_cov){ gsl_vector const * apop_varad_var(v1, NULL); gsl_vector const * apop_varad_var(v2, NULL); gsl_vector const * apop_varad_var(weights, NULL); Apop_stopif(!v1, return GSL_NAN, 0, "first data vector is NULL. Returning NaN."); Apop_stopif(!v2, return GSL_NAN, 0, "second data vector is NULL. Returning NaN."); Apop_stopif(!v1->size, return GSL_NAN, 0, "first data vector has size 0. Returning NaN."); Apop_stopif(!v2->size, return GSL_NAN, 0, "second data vector has size 0. Returning NaN."); Apop_stopif(v1->size!= v2->size, return GSL_NAN, 0, "data vectors have sizes %zu and %zu. Returning NaN.", v1->size, v2->size); Apop_stopif(weights && ((weights->size != v1->size) || (weights->size != v2->size)), return GSL_NAN, 0, "data vectors have sizes %zu and %zu; weighting vector has size %zu. Returning NaN.", v1->size, v2->size, weights->size); return apop_vector_cov_base(v1, v2, weights); } double apop_vector_cov_base(const gsl_vector *v1, const gsl_vector *v2, const gsl_vector *weights){ #endif if (!weights) return gsl_stats_covariance(v1->data, v1->stride, v2->data, v2->stride, v2->size); long double sum1 = 0, sum2 = 0, wsum = 0, sumsq = 0, vv1, vv2, ww; //Using the E(x^2) - E^2(x) form. for (size_t i=0; i< weights->size; i++){ vv1 = gsl_vector_get(v1,i); vv2 = gsl_vector_get(v2,i); ww = gsl_vector_get(weights,i); sum1 += ww * vv1; sum2 += ww * vv2; sumsq+= ww * vv1 * vv2; wsum += ww; } double len = (wsum < 1.1 ? weights->size : wsum); return (sumsq/len - sum1*sum2/gsl_pow_2(len)) *(len/(len-1)); } /** Returns the sample variance/covariance matrix relating each column of the matrix to each other column. \param in An \ref apop_data set. If the weights vector is set, I'll take it into account. \li This is the sample covariance---dividing by \f$n-1\f$, not \f$n\f$. If you need the population variance, use \code apop_data *popcov = apop_data_covariance(indata); int size=indata->matrix->size1; gsl_matrix_scale(popcov->matrix, size/(size-1.)); \endcode \return Returns an \ref apop_data set the variance/covariance matrix. \exception out->error='a' Allocation error. */ apop_data *apop_data_covariance(const apop_data *in){ Apop_stopif(!in, return NULL, 1, "You sent me a NULL apop_data set. Returning NULL."); Apop_stopif(!in->matrix, return NULL, 1, "You sent me an apop_data set with a NULL matrix. Returning NULL."); apop_data *out = apop_data_alloc(in->matrix->size2, in->matrix->size2); Apop_stopif(out->error, return out, 0, "allocation error."); for (size_t i=0; i < in->matrix->size2; i++){ for (size_t j=i; j < in->matrix->size2; j++){ double var = apop_vector_cov(Apop_cv(in, i), Apop_cv(in, j), in->weights); gsl_matrix_set(out->matrix, i,j, var); if (i!=j) gsl_matrix_set(out->matrix, j,i, var); } } apop_name_stack(out->names, in->names, 'c'); apop_name_stack(out->names, in->names, 'r', 'c'); return out; } /** Returns the matrix of correlation coefficients \f$(\sigma^2_{xy}/(\sigma_x\sigma_y))\f$ relating each column with each other. \param in A data matrix: rows are observations, columns are variables. If you give me a weights vector, I'll use it. \return Returns the square variance/covariance matrix with dimensions equal to the number of input columns. \exception out->error='a' Allocation error. */ apop_data *apop_data_correlation(const apop_data *in){ apop_data *out = apop_data_covariance(in); if (!out) return NULL; for(size_t i=0; i< in->matrix->size2; i++){ double std_dev = sqrt(apop_vector_var(Apop_cv(in, i), in->weights)); gsl_vector_scale(Apop_cv(out, i), 1.0/std_dev); gsl_vector_scale(Apop_rv(out, i), 1.0/std_dev); } return out; } /** Given a vector representing a probability distribution of observations, calculate the entropy, \f$\sum_i -\ln(v_i)v_i\f$. \li You may input a vector giving frequencies (normalized to sum to one) or counts (arbitrary sum). \li The entropy of a data set depends only on the frequency with which elements are observed, not the value of the elements themselves. The \ref apop_data_pmf_compress function will reduce an input \ref apop_data set to one weighted line per observation, and the weights would determine the entropy: \code apop_data *data = apop_text_to_data("indata"); apop_data_pmf_compress(data); data_entropy = apop_vector_entropy(d->weights); \endcode \li The entropy is calculated using natural logs. To convert to base 2, divide by \f$\ln(2)\f$; see the example. \li The entropy of an empty data set (\c NULL or a total weight of zero) is zero. Print a warning when given \c NULL input and apop_opts.verbose >=1. \li If the input vector has negative elements, return \c NaN; print a warning when apop_opts.verbose >= 0. Sample code: \include entropy_vector.c */ long double apop_vector_entropy(gsl_vector *in){ Apop_stopif(!in, return 0, 1, "Entropy of a NULL vector ≡ 0"); Apop_stopif(!in->size, return 0, 1, "Entropy of a zero-length vector ≡ 0");//can't happen. //User may or may not have normalized in, so scale everything by the sum. long double sum = apop_vector_sum(in); Apop_stopif(sum<0, return NAN, 0, "Vector sums to a negative value (%Lg). Returning NaN.\n", sum); if (!sum) return 0; long double out=0; for (int i=0; i< in->size; i++){ double val = gsl_vector_get(in, i)/sum; Apop_stopif(val<0, return NAN, 0, "negative value (%g) in vector position %i. Returning NaN.\n", val, i); if (!val) continue; out -= logl(val)*val; } return out; } static long double norment(apop_model *m){ double sigma_sq = gsl_pow_2(apop_data_get(m->parameters, 1)); return (log(2*M_PI*sigma_sq) +1)/2.; } double get_ll(apop_data *d, void *m){ return apop_log_likelihood(d, m); } /** Calculate the entropy of a model: \f$\int -\ln(p(x))p(x)dx\f$, which is the expected value of \f$-\ln(p(x))\f$. The default method is to make draws using \ref apop_model_draws, then evaluate the log likelihood at those points using the model's \c log_likelihood method. There are a number of routines for specific models, inlcuding the \ref apop_normal and \ref apop_pmf models. \li If you want the entropy of a data set, see \ref apop_vector_entropy. \li The entropy is calculated using natural logs. If you prefer base-2 logs, just divide by \f$\ln(2)\f$: apop_model_entropy(my_model)/log(2). \param in A parameterized \ref apop_model. That is, you have already used \ref apop_estimate or \ref apop_model_set_parameters to estimate/set the model parameters. \param draws If using the default method of making random draws, how many random draws to make (default=1,000) Sample code: \include entropy_model.c */ #ifdef APOP_NO_VARIADIC long double apop_model_entropy(apop_model *in, int draws){ #else apop_varad_head(long double, apop_model_entropy){ apop_model * apop_varad_var(in, NULL); Apop_stopif(!in, return NAN, 0, "NULL input model. Returning NaN."); int apop_varad_var(draws, 1000); return apop_model_entropy_base(in, draws); } long double apop_model_entropy_base(apop_model *in, int draws){ #endif static int setup=0; if (!(setup++)){ apop_entropy_vtable_add(norment, apop_normal); } apop_entropy_type e_fn = apop_entropy_vtable_get(in); if (e_fn) return e_fn(in); apop_data *d = apop_model_draws(in, draws); apop_data *lls = apop_map(d, .fn_rp=get_ll, .param=in); long double out = -apop_vector_mean(lls->vector); apop_data_free(d); apop_data_free(lls); return out; } double a_div(gsl_vector *in){ double pi = gsl_vector_get(in, 0); double qi = gsl_vector_get(in, 0); return pi ? pi * log(pi/qi):0; } /** Kullback-Leibler divergence. This measure of the divergence of one distribution from another has the form \f$ D(p,q) = \sum_i \ln(p_i/q_i) p_i \f$. Notice that it is not a distance, because there is an asymmetry between \f$p\f$ and \f$q\f$, so one can expect that \f$D(p, q) \neq D(q, p)\f$. \param from the \f$p\f$ in the above formula. (No default; must not be \c NULL) \param to the \f$q\f$ in the above formula. (No default; must not be \c NULL) \param draw_ct If I do the calculation via random draws, how many? (Default = 1e5) \param rng A \c gsl_rng. If \c NULL or number of threads is greater than 1, I'll take care of the RNG; see \ref apop_rng_get_thread. (Default = \c NULL) This function can take empirical histogram-type models (\ref apop_pmf) or continuous models like \ref apop_loess or \ref apop_normal. If there is a PMF (I'll try \c from first, under the presumption that you are measuring the divergence of a fitted model from an observed data distribution), then I'll step through it for the points in the summation. \li If you have two empirical distributions in the form of \ref apop_pmf, they must be synced: if \f$p_i>0\f$ but \f$q_i=0\f$, then the function returns \c GSL_NEGINF. If apop_opts.verbose >=1 I print a message as well. If neither distribution is a PMF, then I'll take \c draw_ct random draws from \c from and evaluate at those points. \li Set apop_opts.verbose = 3 for observation-by-observation info. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC long double apop_kl_divergence(apop_model *from, apop_model *to, int draw_ct, gsl_rng *rng){ #else apop_varad_head(long double, apop_kl_divergence){ apop_model * apop_varad_var(from, NULL); apop_model * apop_varad_var(to, NULL); Apop_stopif(!from, return NAN, 0, "The first model is NULL; returning NaN."); Apop_stopif(!to, return NAN, 0, "The second model is NULL."); double apop_varad_var(draw_ct, 1e5); gsl_rng * apop_varad_var(rng, apop_rng_get_thread(-1)); return apop_kl_divergence_base(from, to, draw_ct, rng); } long double apop_kl_divergence_base(apop_model *from, apop_model *to, int draw_ct, gsl_rng *rng){ #endif double div = 0; Apop_notify(3, "p(from)\tp(to)\tfrom*log(from/to)\n"); if (*from->name && !strcmp(from->name, "PDF or sparse matrix")){ apop_data *p = from->data; apop_pmf_settings *settings = Apop_settings_get_group(from, apop_pmf); Get_vmsizes(p); //maxsize OMP_for_reduce (+:div, int i=0; i < maxsize; i++){ double pi = p->weights ? gsl_vector_get(p->weights, i)/settings->total_weight : 1./maxsize; if (!pi){ Apop_notify(3, "0\t--\t0"); continue; } //else: double qi = apop_p(Apop_r(p, i), to); Apop_notify(3,"%g\t%g\t%g", pi, qi, pi ? pi * log(pi/qi):0); Apop_stopif(!qi, div+=GSL_NEGINF; break, 1, "The PMFs aren't synced: from-distribution has a value where " "to-distribution doesn't (which produces infinite divergence)."); div += pi * log(pi/qi); } } else { //the version with the RNG. Apop_stopif(!from->dsize, return GSL_NAN, 0, "I need to make random draws from the 'from' model, " "but its dsize (draw size)==0. Returning NaN."); apop_data *draw_list = apop_data_alloc(draw_ct, 2); OMP_for_reduce(+:div, int i=0; i < draw_ct; i++){ double draw[from->dsize]; apop_draw(draw, apop_rng_get_thread(-1), from); gsl_matrix_view dm = gsl_matrix_view_array(draw, 1, from->dsize); double pi = apop_p(&(apop_data){.matrix=&(dm.matrix)}, from); double qi = apop_p(&(apop_data){.matrix=&(dm.matrix)}, to); apop_data_set(draw_list, i, 0, pi); apop_data_set(draw_list, i, 1, qi); Apop_notify(3,"%g\t%g\t%g", pi, qi, pi ? pi * log(pi/qi):0); Apop_stopif(!qi, div+=GSL_NEGINF; break, 1, "From-distribution has a value where " "to-distribution doesn't (which produces infinite divergence)."); } apop_vector_normalize(Apop_cv(draw_list, 0), NULL, 'p'); apop_vector_normalize(Apop_cv(draw_list, 1), NULL, 'p'); div = apop_map_sum(draw_list, .fn_v=a_div); } return div; } /** The multivariate generalization of the Gamma distribution. \f[ \Gamma_p(a)= \pi^{p(p-1)/4}\prod_{j=1}^p \Gamma\left[ a+(1-j)/2\right]. \f] Because \f$\Gamma(x)\f$ is undefined for \f$x\in\{0, -1, -2, ...\}\f$, this function returns \c NAN when \f$a+(1-j)/2\f$ takes on one of those values. See also \ref apop_multivariate_lngamma, which is more numerically stable in most cases. */ long double apop_multivariate_gamma(double a, int p){ Apop_stopif(-(a+(1-p)/2) == (int)-(a+(1-p)/2) && a+(1-p)/2 <=0, return NAN, 1, "Undefined when a + (1-p)/2 = 0, -1, -2, ... [you sent a=%g, p=%i]", a, p); long double out = pow(M_PI, p*(p-1.)/4.); long double factor = 1; for (int i=1; i<=p; i++) factor *= gsl_sf_gamma(a+(1-i)/2.); return out * factor; } /** The log of the multivariate generalization of the Gamma; see also \ref apop_multivariate_gamma. */ long double apop_multivariate_lngamma(double a, int p){ Apop_stopif(-(a+(1-p)/2) == (int)-(a+(1-p)/2) && a+(1-p)/2 <=0, return NAN, 1, "Undefined when a + (1-p)/2 = 0, -1, -2, ... [you sent a=%g, p=%i]", a, p); long double out = M_LNPI * p*(p-1.)/4.; for (int i=1; i<=p; i++) out += gsl_sf_lngamma(a+(1-i)/2.); return out; } static void find_eigens(gsl_matrix **subject, gsl_vector *eigenvals, gsl_matrix *eigenvecs){ gsl_eigen_symmv_workspace * w = gsl_eigen_symmv_alloc((*subject)->size1); gsl_eigen_symmv(*subject, eigenvals, eigenvecs, w); gsl_eigen_symmv_free (w); gsl_matrix_free(*subject); *subject = NULL; } static void diagonal_copy(gsl_vector *v, gsl_matrix *m, char in_or_out){ gsl_vector_view dv = gsl_matrix_diagonal(m); if (in_or_out == 'i') gsl_vector_memcpy(&(dv.vector), v); else gsl_vector_memcpy(v, &(dv.vector)); } static double diagonal_size(gsl_matrix *m){ gsl_vector_view dv = gsl_matrix_diagonal(m); return apop_sum(&dv.vector); } static double biggest_elmt(gsl_matrix *d){ return GSL_MAX(fabs(gsl_matrix_max(d)), fabs(gsl_matrix_min(d))); } /** Test whether the input matrix is positive semidefinite (PSD). A covariance matrix will always be PSD, so this function can tell you whether your matrix is a valid covariance matrix. Consider the 1x1 matrix in the upper left of the input, then the 2x2 matrix in the upper left, on up to the full matrix. If the matrix is PSD, then each of these has a positive determinant. This function thus calculates \f$N\f$ determinants for an \f$N\f$x\f$N\f$ matrix. \param m The matrix to test. If \c NULL, I will return zero---not PSD. \param semi If anything but \c 's', check for positive definite, not semidefinite. (default 's') See also \ref apop_matrix_to_positive_semidefinite, which will change the input to something PSD. \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC int apop_matrix_is_positive_semidefinite(gsl_matrix *m, char semi){ #else apop_varad_head(int, apop_matrix_is_positive_semidefinite){ gsl_matrix * apop_varad_var(m, NULL); Apop_stopif(!m, return 0, 1, "You gave me a NULL matrix. I will take this as not positive semidefinite; returning zero."); char apop_varad_var(semi, 's'); return apop_matrix_is_positive_semidefinite_base(m, semi); } int apop_matrix_is_positive_semidefinite_base(gsl_matrix *m, char semi){ #endif for (int i=1; i<= m->size1; i++){ gsl_matrix mv =gsl_matrix_submatrix (m, 0, 0, i, i).matrix; double det = apop_matrix_determinant(&mv); if ((semi == 'd' && det <0) || det <=0) return 0; } return 1; } void vfabs(double *x){*x = fabs(*x);} /** This function takes in a matrix and converts it in place to the `closest' positive semidefinite matrix. \param m On input, any matrix; on output, a positive semidefinite matrix. If \c NULL, return \c NaN and print an error. \return the distance between the original and new matrices. \li See also the test function \ref apop_matrix_is_positive_semidefinite. \li This function can be used as the core of a model constraint. \li Adapted from the R Matrix package's nearPD, which is Copyright (2007) Jens Oehlschlägel [under the GPL]. */ double apop_matrix_to_positive_semidefinite(gsl_matrix *m){ Apop_stopif(!m, return NAN, 0, "Got a NULL matrix. Returning NaN."); if (apop_matrix_is_positive_semidefinite(m)) return 0; double diffsize=0, dsize; apop_data *qdq; gsl_matrix *d = apop_matrix_copy(m); gsl_matrix *original = apop_matrix_copy(m); double orig_diag_size = fabs(diagonal_size(d)); int size = d->size1; gsl_vector *diag = gsl_vector_alloc(size); diagonal_copy(diag, d, 'o'); apop_vector_apply(diag, vfabs); double origsize = biggest_elmt(d); do { //get eigenvals apop_data *eigenvecs = apop_data_alloc(size, size); gsl_vector *eigenvals = gsl_vector_calloc(size); gsl_matrix *junk_copy = apop_matrix_copy(d); find_eigens(&junk_copy, eigenvals, eigenvecs->matrix);//junk freed here. //prune positive only int j=0; int plussize = eigenvecs->matrix->size1; int *mask = calloc(eigenvals->size , sizeof(int)); for (int i=0; i< eigenvals->size; i++) plussize -= mask[i] = (gsl_vector_get(eigenvals, i) <= 0); //construct Q = pruned eigenvals apop_data_rm_columns(eigenvecs, mask); if (!eigenvecs->matrix) break; //construct D = positive eigen diagonal apop_data *eigendiag = apop_data_calloc(0, plussize, plussize); for (int i=0; i< eigenvals->size; i++) if (!mask[i]) { apop_data_set(eigendiag, j, j, eigenvals->data[i]); j++; } // Our candidate is QDQ', symmetrized, with the old diagonal subbed in. apop_data *qd = apop_dot(eigenvecs, eigendiag); qdq = apop_dot(qd, eigenvecs, .form2='t'); for (int i=0; i< qdq->matrix->size1; i++) for (int j=i+1; j< qdq->matrix->size1; j++){ double avg = (apop_data_get(qdq, i, j) +apop_data_get(qdq, j, i)) /2.; apop_data_set(qdq, i, j, avg); apop_data_set(qdq, j, i, avg); } diagonal_copy(diag, qdq->matrix, 'i'); // Evaluate progress, clean up. dsize = biggest_elmt(d); gsl_matrix_sub(d, qdq->matrix); diffsize = biggest_elmt(d); apop_data_free(qd); gsl_matrix_free(d); apop_data_free(eigendiag); free(mask); apop_data_free(eigenvecs); gsl_vector_free(eigenvals); d = qdq->matrix; qdq->matrix=NULL; apop_data_free(qdq); qdq = NULL; } while (diffsize/dsize > 1e-3); apop_data *eigenvecs = apop_data_alloc(size, size); gsl_vector *eigenvals = gsl_vector_calloc(size); find_eigens(&d, eigenvals, eigenvecs->matrix);//free d here. //make eigenvalues more positive double score =0; for (int i=0; i< eigenvals->size; i++){ double v = gsl_vector_get(eigenvals, i); if (v < 1e-1){ gsl_vector_set(eigenvals, i, 1e-1); score += 1e-1 - v; } } for (int i=0; i< size; i++) assert(eigenvals->data[i] >=0); //if (score){ apop_data *eigendiag = apop_data_calloc(0, size, size); diagonal_copy(eigenvals, eigendiag->matrix, 'i'); double new_diag_size = diagonal_size(eigendiag->matrix); gsl_matrix_scale(eigendiag->matrix, orig_diag_size/new_diag_size); apop_data *qd = apop_dot(eigenvecs, eigendiag); qdq = apop_dot(qd, eigenvecs, .form2='t'); gsl_matrix_memcpy(m, qdq->matrix); apop_data_free(qd); apop_data_free(eigendiag); //} assert(apop_matrix_is_positive_semidefinite(m)); apop_data_free(qdq); gsl_vector_free(diag); apop_data_free(eigenvecs); gsl_vector_free(eigenvals); gsl_matrix_sub(original, m); return biggest_elmt(original)/origsize; } apophenia-1.0+ds/apop_tests.c000066400000000000000000000626551262736346100162750ustar00rootroot00000000000000 /** \file apop_tests.c */ /* Copyright (c) 2007 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" static apop_data * produce_t_test_output(int df, double stat, double diff){ double pval, qval, two_tail; if (!gsl_isnan(stat)){ pval = gsl_cdf_tdist_P(stat, df); qval = 1-pval; two_tail= 2*GSL_MIN(fabs(pval-.5),fabs(qval-0.5)); } else { pval = GSL_NAN; qval = GSL_NAN; two_tail= GSL_NAN; } apop_data *out = apop_data_alloc(); apop_data_add_named_elmt(out, "mean left - right", diff); apop_data_add_named_elmt(out, "t statistic", stat); apop_data_add_named_elmt(out, "df", df); apop_data_add_named_elmt(out, "p value, 1 tail", GSL_MIN(pval,qval)); apop_data_add_named_elmt(out, "confidence, 1 tail", 1 - GSL_MIN(pval,qval)); apop_data_add_named_elmt(out, "p value, 2 tail", 1- two_tail); apop_data_add_named_elmt(out, "confidence, 2 tail", two_tail); return out; } /** Answers the question: with what confidence can I say that the means of these two columns of data are different? If \c apop_opts.verbose is >=1, then display some information to stdout, like the mean/var/count for both vectors and the t statistic. \param a one column of data \param b another column of data \return an \ref apop_data set with the following elements: mean left - right: the difference in means; if positive, first vector has larger mean, and one-tailed test is testing \f$L > R\f$, else reverse if negative.
t statistic: used for the test
df: degrees of freedom
p value, 1 tail: the p-value for a one-tailed test that one vector mean is greater than the other.
confidence, 1 tail: 1- p value.
p value, 2 tail: the p-value for the two-tailed test that left mean = right mean.
confidence, 2 tail: 1-p value Example usage: \code gsl_vector *L = apop_query_to_vector("select * from data where sex='M'"); gsl_vector *R = apop_query_to_vector("select * from data where sex='F'"); apop_data *test_out = apop_t_test(L, R); printf("Reject the null hypothesis of no difference between M and F with %g%% confidence\n", apop_data_get(test_out, .rowname="confidence, 2 tail")); \endcode \see \ref apop_paired_t_test, which answers the question: with what confidence can I say that the mean difference between the two columns is zero? */ apop_data * apop_t_test(gsl_vector *a, gsl_vector *b){ int a_count = a->size, b_count = b->size; double a_avg = apop_vector_mean(a); double a_var = (a_count > 1) ? apop_vector_var(a) : 0, b_avg = apop_vector_mean(b), b_var = b_count > 1 ? apop_vector_var(b): 0, stat = (a_avg - b_avg)/ sqrt( (b_count > 1 ? b_var/(b_count-1) : 0) + (a_count > 1 ? a_var/(a_count-1) : 0) ); if (apop_opts.verbose >=1){ printf("1st avg: %g; 1st std dev: %g; 1st count: %i.\n", a_avg, sqrt(a_var), a_count); printf("2st avg: %g; 2st std dev: %g; 2nd count: %i.\n", b_avg, sqrt(b_var), b_count); printf("t-statistic: %g.\n", stat); } int df = a_count+b_count-2; return produce_t_test_output(df, stat, a_avg - b_avg); } /** Answers the question: with what confidence can I say that the mean difference between the two columns is zero? If apop_opts.verbose >=2, then display some information, like the mean/var/count for both vectors and the t statistic, to stderr. \param a A column of data \param b A matched column of data \return an \ref apop_data set with the following elements: mean left - right: the difference in means; if positive, first vector has larger mean, and one-tailed test is testing \f$L > R\f$, else reverse if negative.
t statistic: used for the test
df: degrees of freedom
p value, 1 tail: the p-value for a one-tailed test that one vector mean is greater than the other.
confidence, 1 tail: 1- p value.
p value, 2 tail: the p-value for the two-tailed test that left mean = right mean.
confidence, 2 tail: 1-p value \see \ref apop_t_test for an example, and for when the element-by-element difference between the vectors has no sensible interpretation. */ apop_data * apop_paired_t_test(gsl_vector *a, gsl_vector *b){ gsl_vector *diff = gsl_vector_alloc(a->size); gsl_vector_memcpy(diff, a); gsl_vector_sub(diff, b); int count = a->size; double avg = apop_vector_mean(diff), var = apop_vector_var(diff), stat = avg/ sqrt(var/(count-1)); gsl_vector_free(diff); Apop_notify(2, "avg diff: %g; diff std dev: %g; count: %i; t-statistic: %g.\n", avg, sqrt(var), count, stat); return produce_t_test_output(count-1, stat, avg); } /** Runs an F-test specified by \c q and \c c. See the chapter on hypothesis testing in Modeling With Data, p 309, which will tell you that: \f[{N-K\over q} {({\bf Q}'\hat\beta - {\bf c})' [{\bf Q}' ({\bf X}'{\bf X})^{-1} {\bf Q}]^{-1} ({\bf Q}' \hat\beta - {\bf c}) \over {\bf u}' {\bf u} } \sim F_{q,N-K},\f] and that's what this function is based on. \param est An \ref apop_model that you have already calculated. (No default) \param contrast An \ref apop_data set whose matrix represents \f${\bf Q}\f$ and whose vector represents \f${\bf c}\f$. Each row represents a hypothesis. (Defaults: if matrix is \c NULL, it is set to the identity matrix with the top row missing. If the vector is \c NULL, it is set to a zero matrix of length equal to the height of the contrast matrix. Thus, if the entire \c apop_data set is NULL or omitted, we are testing the hypothesis that all but \f$\beta_1\f$ are zero.) \return An \c apop_data set with a few variants on the confidence with which we can reject the joint hypothesis. \todo There should be a way to get OLS and GLS to store \f$(X'X)^{-1}\f$. In fact, if you did GLS, this is invalid, because you need \f$(X'\Sigma X)^{-1}\f$, and I didn't ask for \f$\Sigma\f$. \exception out->error='a' Allocation error. \exception out->error='d' dimension-matching error. \exception out->error='i' matrix inversion error. \exception out->error='m' GSL math error. \li There are two approaches to an \f$F\f$-test: the ANOVA approach, which is typically built around the claim that all effects but the mean are zero; and the more general regression form, which allows for any set of linear claims about the data. If you send a \c NULL contrast set, I will generate the set of linear contrasts that are equivalent to the ANOVA-type approach. This is why the top row of the default \f${\bf Q}\f$ matrix is missing: there is no hypothesis test about the coefficient for the constant term. See the example below. \li This function uses the \ref designated syntax for inputs. \include f_test.c */ #ifdef APOP_NO_VARIADIC apop_data * apop_f_test(apop_model *est, apop_data *contrast){ #else apop_varad_head(apop_data *, apop_f_test){ apop_model *apop_varad_var(est, NULL) Nullcheck_m(est, NULL); Nullcheck_d(est->data, NULL); apop_data * apop_varad_var(contrast, NULL); int free_data=0,free_matrix=0,free_vector=0; if (!contrast) contrast = apop_data_alloc(),free_data=1; if (!contrast->matrix) { int size = est->parameters->vector->size; contrast->matrix= gsl_matrix_calloc(size - 1, size); for (int i=1; i< size; i++) apop_data_set(contrast, i-1, i, 1); } if (!contrast->vector) contrast->vector = gsl_vector_calloc(contrast->matrix->size1),free_vector=1; apop_data *out = apop_f_test_base(est, contrast); if (free_data) {apop_data_free(contrast); return out;} if (free_matrix) gsl_matrix_free(contrast->matrix); if (free_vector) gsl_vector_free(contrast->vector); return out; return apop_f_test_base(est, contrast); } apop_data * apop_f_test_base(apop_model *est, apop_data *contrast){ #endif apop_data *out = apop_data_alloc(); Asprintf(&out->names->title, "F test"); size_t contrast_ct = contrast->vector->size; Apop_stopif(contrast->matrix->size1 != contrast_ct, out->error='d'; return out, 0, "I counted %zu contrasts by the size of either contrast->vector or " "est->parameters->vector->size, but you gave me a matrix with %zu rows. Those should match." , contrast_ct, contrast->matrix->size1); double f_stat, pval; Get_vmsizes(est->data); //msize1, msize2 int data_df = msize1 - contrast_ct; //Find (\underbar x)'(\underbar x), where (\underbar x) = the data with means removed long double means[msize2]; for (int i=1; i< msize2; i++) means[i] = apop_vector_mean(Apop_cv(est->data, i)); means[0]=0;// don't screw with the ones column. apop_data *xpx = apop_data_alloc(msize2, msize2); Apop_stopif(xpx->error, apop_data_free(xpx); out->error='a'; return out, 0, "allocation error"); for (int i=0; i< msize2; i++) for (int j=0; j< msize2; j++){ //at this loop, we calculate one cell in the dot prouct long double total = 0; for (int c=0; cdata->matrix, c, i) - means[i]) *(gsl_matrix_get(est->data->matrix, c, j) - means[j]); apop_data_set(xpx, i, j, total); } apop_data xpxinv = (apop_data){.matrix=apop_matrix_inverse(xpx->matrix)}; Apop_stopif(!xpxinv.matrix, out->error='i'; return out, 0, "inversion of X'X error"); apop_data *qprimexpxinv = apop_dot(contrast, &xpxinv, 'm', 'm'); apop_data *qprimexpxinvq = apop_dot(qprimexpxinv, contrast, 'm', 't'); Apop_stopif(qprimexpxinvq->error || qprimexpxinv->error, out->error='m'; return out, 0, "broken dot"); apop_data qprimexpxinvqinv = (apop_data){.matrix=apop_matrix_inverse(qprimexpxinvq->matrix)}; Apop_stopif(!qprimexpxinvqinv.matrix, out->error='i'; return out, 0, "inversion of Q'(X'X)^{-1}Q error"); apop_data_free(qprimexpxinvq); apop_data_free(qprimexpxinv); apop_data *qprimebeta = apop_dot(contrast, est->parameters, 'm', 'v'); Apop_stopif(qprimebeta->error, out->error='m'; return out, 0, "broken dot"); gsl_vector_sub(qprimebeta->vector, contrast->vector); apop_data *qprimebetaminusc_qprimexpxinvqinv = apop_dot(&qprimexpxinvqinv, qprimebeta, .form2='v'); Apop_stopif(qprimebetaminusc_qprimexpxinvqinv->error, out->error='m'; return out, 0, "broken dot"); gsl_blas_ddot(qprimebeta->vector, qprimebetaminusc_qprimexpxinvqinv->vector, &f_stat); apop_data_free(xpx); apop_data_free(qprimebeta); apop_data_free(qprimebetaminusc_qprimexpxinvqinv); apop_data *r_sq_list = apop_estimate_coefficient_of_determination (est); double variance = apop_data_get(r_sq_list, .rowname="sse"); f_stat *= data_df / (variance * contrast_ct); pval = (contrast_ct > 0 && data_df > 0) ? gsl_cdf_fdist_Q(f_stat, contrast_ct, data_df): GSL_NAN; apop_data_add_named_elmt(out, "F statistic", f_stat); apop_data_add_named_elmt(out, "p value", pval); apop_data_add_named_elmt(out, "confidence", 1- pval); apop_data_add_named_elmt(out, "df1", contrast_ct); apop_data_add_named_elmt(out, "df2", data_df); return out; } static double one_chi_sq(apop_data *d, int row, int col, int n){ double rowexp = apop_vector_sum(Apop_rv(d, row))/n; double colexp = apop_vector_sum(Apop_cv(d, col))/n; double observed = apop_data_get(d, row, col); double expected = n * rowexp * colexp; return gsl_pow_2(observed - expected)/expected; } /** Run a Chi-squared test on an ANOVA table, i.e., an NxN table with the null hypothesis that all cells are equally likely. \param d The input data, which is a crosstab of various elements. They don't have to sum to one. \return A \ref apop_data set including elements named "chi squared statistic", "df", and "p value". Retrieve via, e.g., apop_data_get(out, .rowname="p value"). \see apop_test_fisher_exact */ apop_data * apop_test_anova_independence(apop_data *d){ Apop_stopif(!d || !d->matrix, return NULL, 0, "You sent me data with no matrix element. Returning NULL."); double total = 0; //You can have a one-column or one-row matrix if you want; else df = (rows-1)*(cols-1) double df = d->matrix->size1==1 ? d->matrix->size2-1 : d->matrix->size2 == 1 ? d->matrix->size1 : (d->matrix->size1 - 1)* (d->matrix->size2 - 1); Apop_stopif(!df, return NULL, 0, "You sent a degenerate matrix. Returning NULL."); int n = apop_matrix_sum(d->matrix); for (size_t row=0; row matrix->size1; row++) for (size_t col=0; col matrix->size2; col++) total += one_chi_sq(d, row, col, n); apop_data *out = apop_data_alloc(); double chisq = gsl_cdf_chisq_Q(total, df); apop_data_add_named_elmt(out, "chi squared statistic", total); apop_data_add_named_elmt(out, "df", df); apop_data_add_named_elmt(out, "p value", chisq); return out; } static apop_data* apop_anova_one_way(char *table, char *data, char *grouping){ //ANOVA has always just been a process of filling in a form, and //that's what this function does. apop_data *out = apop_data_calloc(3, 6); apop_name_add(out->names, "sum of squares", 'c'); apop_name_add(out->names, "df", 'c'); apop_name_add(out->names, "mean squares", 'c'); apop_name_add(out->names, "F ratio", 'c'); apop_name_add(out->names, "p value", 'c'); apop_name_add(out->names, "confidence", 'c'); apop_name_add(out->names, grouping, 'r'); apop_name_add(out->names, "residual", 'r'); apop_name_add(out->names, "total", 'r'); //total sum of squares: apop_data* tss = apop_query_to_data("select var_pop(%s), count(*) from %s", data, table); Apop_stopif(!tss, apop_return_data_error('q'), 0, "Query 'select var_pop(%s), count(*) from %s' returned NULL. Does that look right to you?", data, table); apop_data_set(out, 2, 0, apop_data_get(tss, 0, 0)*apop_data_get(tss, 0, 1)); //total sum of squares double total_df = apop_data_get(tss, 0, 1); apop_data_set(out, 2, 1, apop_data_get(tss, 0, 1)); //total df. //within group sum of squares: apop_data* wss = apop_query_to_data("select var_pop(%s), count(*) from %s group by %s", data, table, grouping); double group_df = wss->matrix->size1-1; apop_data_set(out, 0, 0, apop_data_get(wss, 0, 0)*group_df); //within sum of squares apop_data_set(out, 0, 1, group_df); //residuals are just total-wss apop_data_set(out, 1, 0, apop_data_get(out, 2, 0) - apop_data_get(out, 0,0)); //residual sum of squares double residual_df = total_df - group_df; apop_data_set(out, 1, 1, residual_df); //residual df apop_data_set(out, 0, 2, apop_data_get(out, 0, 0)/apop_data_get(out, 0, 1));//mean SS within apop_data_set(out, 1, 2, apop_data_get(out, 1, 0)/apop_data_get(out, 1, 1));//mean SS residual apop_data_set(out, 0, 3, apop_data_get(out, 0, 2)/apop_data_get(out, 1, 2));//F ratio apop_data_set(out, 0, 4, gsl_cdf_fdist_P(apop_data_get(out, 0, 3), group_df, residual_df));//pval apop_data_set(out, 0, 5, 1- apop_data_get(out, 0, 4));//confidence apop_data_free(tss); apop_data_free(wss); return out; } /** This function produces a traditional one- or two-way ANOVA table. It works from data in an SQL table, using queries of a form like select data from table group by grouping1, grouping2. \param table The table to be queried. Anything that can go in an SQL from clause is OK, so this can be a plain table name or a temp table specification like (select ... ), with parens. \param data The name of the column holding the count or other such data \param grouping1 The name of the first column by which to group data \param grouping2 If this is \c NULL, then the function will return a one-way ANOVA. Otherwise, the name of the second column by which to group data in a two-way ANOVA. */ #ifdef APOP_NO_VARIADIC apop_data* apop_anova(char *table, char *data, char *grouping1, char *grouping2){ #else apop_varad_head(apop_data*, apop_anova){ char *apop_varad_var(table, NULL) Apop_stopif(!table, return NULL, 0, "I need the name of a table in the SQL database."); if (!strchr(table, ')')) //if you found ()s,then it is a temp table spec. Apop_stopif(!apop_table_exists(table), return NULL, 0, "I couldn't find the table %s in the database.", table); char *apop_varad_var(data, NULL) Apop_stopif(!data, return NULL, 0, "I need the name of the column in the %s table with the count or other data.", table); char *apop_varad_var(grouping1, NULL) Apop_stopif(!data, return NULL, 0, "I need at least grouping1, a column in the %s table.", table); char *apop_varad_var(grouping2, NULL) return apop_anova_base(table, data, grouping1, grouping2); } apop_data* apop_anova_base(char *table, char *data, char *grouping1, char *grouping2){ #endif apop_data *first = apop_anova_one_way(table, data, grouping1); Apop_stopif(first->error, return first, 0, "Error (%c) running one-way ANOVA.", first->error); if (!grouping2) return first; apop_data *second = apop_anova_one_way(table, data, grouping2); char *joined = NULL; Asprintf(&joined, "%s, %s", grouping1, grouping2); apop_data *interaction = apop_anova_one_way(table, data, joined); apop_data *out = apop_data_calloc(5, 6); apop_name_stack(out->names, first->names, 'c'); apop_data_add_names(out, 'r', first->names->row[0], second->names->row[0], "interaction", "residual", "total"); gsl_vector *firstrow = Apop_rv(first, 0); gsl_vector *secondrow = Apop_rv(second, 0); gsl_vector *interrow = Apop_rv(interaction, 0); gsl_matrix_set_row(out->matrix, 0, firstrow); gsl_matrix_set_row(out->matrix, 1, secondrow); gsl_matrix_set_row(out->matrix, 2, interrow); gsl_matrix_set_row(out->matrix, 4, Apop_rv(first, 2)); //residuals are just total-wss apop_data_set(out, 3, 0, apop_data_get(out, 4, 0) - gsl_vector_get(firstrow, 0) - gsl_vector_get(secondrow, 0) - gsl_vector_get(interrow, 0)); //residual sum of squares double residual_df = apop_data_get(out, 4, 1) - gsl_vector_get(firstrow, 1) - gsl_vector_get(secondrow, 1) - gsl_vector_get(interrow, 1); //residual df apop_data_set(out, 3, 1, residual_df); apop_data_set(out, 3, 2, apop_data_get(out, 3, 0)/apop_data_get(out, 3, 1));//mean SS residual apop_data_set(out, 0, 3, apop_data_get(out, 0, 2)/apop_data_get(out, 3, 2));//F ratio apop_data_set(out, 0, 4, gsl_cdf_fdist_P(apop_data_get(out, 0, 3), gsl_vector_get(firstrow, 1), residual_df));//pval apop_data_set(out, 0, 5, 1- apop_data_get(out, 0, 4));//confidence apop_data_set(out, 1, 3, apop_data_get(out, 1, 2)/apop_data_get(out, 3, 2));//F ratio apop_data_set(out, 1, 4, gsl_cdf_fdist_P(apop_data_get(out, 1, 3), gsl_vector_get(secondrow, 1), residual_df));//pval apop_data_set(out, 1, 5, 1- apop_data_get(out, 1, 4));//confidence apop_data_set(out, 2, 3, apop_data_get(out, 2, 2)/apop_data_get(out, 3, 2));//F ratio apop_data_set(out, 2, 4, gsl_cdf_fdist_P(apop_data_get(out, 2, 3), gsl_vector_get(interrow, 1), residual_df));//pval apop_data_set(out, 2, 5, 1- apop_data_get(out, 2, 4));//confidence free(joined); apop_data_free(first); apop_data_free(second); apop_data_free(interaction); return out; } /** This is a convenience function to do the lookup of a given statistic along a given distribution. You give me a statistic, its (hypothesized) distribution, and whether to use the upper tail, lower tail, or both. I will return the odds of a Type I error given the model---in statistician jargon, the \f$p\f$-value. [Type I error: odds of rejecting the null hypothesis when it is true.] For example, \code apop_test(1.3); \endcode will return the density of the standard Normal distribution that is more than 1.3 from zero. If this function returns a small value, we can be confident that the statistic is significant. Or, \code apop_test(1.3, "t", 10, .tail='u'); \endcode will give the appropriate odds for an upper-tailed test using the \f$t\f$-distribution with 10 degrees of freedom (e.g., a \f$t\f$-test of the null hypothesis that the statistic is less than or equal to zero). Several more distributions are supported; see below. \li For a two-tailed test (the default), this returns the density outside the range. I'll only do this for symmetric distributions. \li For an upper-tail test ('u'), this returns the density above the cutoff \li For a lower-tail test ('l'), this returns the density below the cutoff \param statistic The scalar value to be tested. \param distribution The name of the distribution; see below. \param p1 The first parameter for the distribution; see below. \param p2 The second parameter for the distribution; see below. \param tail 'u' = upper tail; 'l' = lower tail; anything else = two-tailed. (default = two-tailed) \return The odds of a Type I error given the model (the \f$p\f$-value). Here are the distributions you can use and their parameters. \c "normal" or \c "gaussian" \li p1=\f$\mu\f$, p2=\f$\sigma\f$ \li default (0, 1) \c "lognormal" \li p1=\f$\mu\f$, p2=\f$\sigma\f$ \li default (0, 1) \li Remember, \f$\mu\f$ and \f$\sigma\f$ refer to the Normal one would get after exponentiation \li One-tailed tests only \c "uniform" \li p1=lower edge, p2=upper edge \li default (0, 1) \li two-tailed tests are run relative to the center, (p1+p2)/2. \c "t" \li p1=df \li no default \c "chi squared", \c "chi", \c "chisq": \li p1=df \li no default \li One-tailed tests only; default='u' (\f$p\f$-value for typical cases) \c "f" \li p1=df1, p2=df2 \li no default \li One-tailed tests only \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC double apop_test(double statistic, char *distribution, double p1, double p2, char tail){ #else apop_varad_head(double, apop_test){ double apop_varad_var(statistic, 0); char* apop_varad_var(distribution, NULL); double apop_varad_var(p1, 0); double apop_varad_var(p2, 0); int is_chi = !strcasecmp(distribution, "chi squared")|| !strcasecmp(distribution, "chi") || !strcasecmp(distribution, "chisq"); Apop_stopif(!strcasecmp(distribution, "f") && (!p1 || !p2), return NAN, 0, "I need both a p1 and p2 parameter specifying the degrees of freedom."); Apop_stopif((!strcasecmp(distribution, "t") || !strcasecmp(distribution, "f") || is_chi) && !p1, return NAN, 0, "I need a p1 parameter specifying the degrees of freedom."); if (!p2 && (!distribution || !strcasecmp(distribution, "normal") || !strcasecmp(distribution, "gaussian") )) p2 = 1; if (!p2 && p1 >= 0 && !strcasecmp(distribution, "uniform")) p2 = 1; char apop_varad_var(tail, 0); if (!tail) tail = is_chi ? 'u' : 'a'; return apop_test_base(statistic, distribution, p1, p2, tail); } double apop_test_base(double statistic, char *distribution, double p1, double p2, char tail){ #endif //This is a long and boring function. I am aware that there are //clever way to make it shorter. if (!distribution || !strcasecmp(distribution, "normal") || !strcasecmp(distribution, "gaussian") ){ if (tail == 'u') return gsl_cdf_gaussian_Q(p1-statistic, p2); else if (tail == 'l') return gsl_cdf_gaussian_P(p1-statistic, p2); else return 2 * gsl_cdf_gaussian_Q(fabs(p1-statistic), p2); } else if (!strcasecmp(distribution, "lognormal")){ if (tail == 'u') return gsl_cdf_lognormal_Q(statistic, p1, p2); else if (tail == 'l') return gsl_cdf_lognormal_P(statistic, p1, p2); else Apop_assert(0, "A two-tailed test doesn't really make sense for the lognormal. Please specify either tail= 'u' or tail= 'l'."); } else if (!strcasecmp(distribution, "t")){ if (tail == 'u') return gsl_cdf_tdist_Q(statistic, p1); else if (tail == 'l') return gsl_cdf_tdist_P(statistic, p1); else return 2 * gsl_cdf_tdist_Q(fabs(statistic), p1); } else if (!strcasecmp(distribution, "f")){ if (tail == 'u') return gsl_cdf_fdist_Q(statistic, p1, p2); else if (tail == 'l') return gsl_cdf_fdist_P(statistic, p1, p2); else Apop_assert(0, "A two-tailed test doesn't really make sense for the %s. Please specify either tail= 'u' or tail= 'l'.", distribution); } else if (!strcasecmp(distribution, "chi squared")|| !strcasecmp(distribution, "chi") || !strcasecmp(distribution, "chisq")){ if (tail == 'u') return gsl_cdf_chisq_Q(statistic, p1); else if (tail == 'l') return gsl_cdf_chisq_P(statistic, p1); else Apop_assert(0, "A two-tailed test doesn't really make sense for the %s. Please specify either tail= 'u' or tail= 'l'.", distribution); } else if (!strcasecmp(distribution, "uniform")){ if (tail == 'u') return gsl_cdf_flat_Q(statistic, p1, p2); else if (tail == 'l') return gsl_cdf_flat_P(statistic, p1, p2); else return 2 * gsl_cdf_flat_Q(fabs(statistic - (p1+p2)/2.), p1, p2); } Apop_assert(0, "Sorry, but I don't recognize %s as a distribution", distribution); } apophenia-1.0+ds/apop_update.c000066400000000000000000000273471262736346100164140ustar00rootroot00000000000000 /** \file The \ref apop_update function. */ /* Copyright (c) 2006--2009, 2014 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include /* This file in four parts: --an apop_model named product, purpose-built for apop_update to send to apop_model_metropolis --apop_mcmc settings and their defaults. --apop_update and its equipment, which has three cases: --conjugates, in which case see the functions --call Metropolis */ /* This will be used by apop_update to send to apop_mcmc below. To set it up, add a more pointer to an array of two models, the prior and likelihood. The total likelihood of a data point is (likelihood these parameters are drawn from prior)*(likelihood of these parameters and the data set using the likelihood fn) */ static long double product_ll(apop_data *d, apop_model *m){ apop_model **pl = m->more; gsl_vector *v = apop_data_pack(m->parameters); apop_data_unpack(v, pl[1]->parameters); gsl_vector_free(v); return apop_log_likelihood(m->parameters, pl[0]) + apop_log_likelihood(d, pl[1]); } static long double product_constraint(apop_data *data, apop_model *m){ apop_model **pl = m->more; gsl_vector *v = apop_data_pack(m->parameters); apop_data_unpack(v, pl[1]->parameters); gsl_vector_free(v); return pl[1]->constraint(data, pl[1]); } apop_model *product = &(apop_model){"product of two models", .log_likelihood=product_ll, .constraint=product_constraint}; ///////////the conjugate table static apop_model *betabinom(apop_data *data, apop_model *prior, apop_model *likelihood){ apop_model *outp = apop_model_copy(prior); if (!data && likelihood->parameters){ double n = likelihood->parameters->vector->data[0]; double p = likelihood->parameters->vector->data[1]; *gsl_vector_ptr(outp->parameters->vector, 0) += n*p; *gsl_vector_ptr(outp->parameters->vector, 1) += n*(1-p); } else { gsl_vector *hits = Apop_cv(data, 1); gsl_vector *misses = Apop_cv(data, 0); *gsl_vector_ptr(outp->parameters->vector, 0) += apop_sum(hits); *gsl_vector_ptr(outp->parameters->vector, 1) += apop_sum(misses); } return outp; } double countup(double in){return in!=0;} static apop_model *betabernie(apop_data *data, apop_model *prior, apop_model *likelihood){ apop_model *outp = apop_model_copy(prior); Get_vmsizes(data);//tsize double sum = apop_map_sum(data, .fn_d=countup, .part='a'); *gsl_vector_ptr(outp->parameters->vector, 0) += sum; *gsl_vector_ptr(outp->parameters->vector, 1) += tsize - sum; return outp; } static apop_model *gammaexpo(apop_data *data, apop_model *prior, apop_model *likelihood){ apop_model *outp = apop_model_copy(prior); Get_vmsizes(data); //maxsize *gsl_vector_ptr(outp->parameters->vector, 0) += maxsize; apop_data_set(outp->parameters, 1, .val=1./ (1./apop_data_get(outp->parameters, 1) + (data->matrix ? apop_matrix_sum(data->matrix) : 0) + (data->vector ? apop_sum(data->vector) : 0))); return outp; } static apop_model *gammapoisson(apop_data *data, apop_model *prior, apop_model *likelihood){ /* Posterior alpha = alpha_0 + sum x; posterior beta = beta_0/(beta_0*n + 1) */ apop_model *outp = apop_model_copy(prior); Get_vmsizes(data); //vsize, msize1,maxsize *gsl_vector_ptr(outp->parameters->vector, 0) += (vsize ? apop_sum(data->vector): 0) + (msize1 ? apop_matrix_sum(data->matrix): 0); double *beta = gsl_vector_ptr(outp->parameters->vector, 1); *beta = *beta/(*beta * maxsize + 1); return outp; } static apop_model *normnorm(apop_data *data, apop_model *prior, apop_model *likelihood){ /* output \f$(\mu, \sigma) = (\frac{\mu_0}{\sigma_0^2} + \frac{\sum_{i=1}^n x_i}{\sigma^2})/(\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}), (\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2})^{-1}\f$ That is, the output is weighted by the number of data points for the likelihood. If you give me a parametrized normal, with no data, then I'll take the weight to be \f$n=1\f$. */ double mu_like, var_like; long int n; apop_model *outp = apop_model_copy(prior); apop_prep(data, outp); long double mu_pri = prior->parameters->vector->data[0]; long double var_pri = gsl_pow_2(prior->parameters->vector->data[1]); if (!data && likelihood->parameters){ mu_like = likelihood->parameters->vector->data[0]; var_like = gsl_pow_2(likelihood->parameters->vector->data[1]); n = 1; } else { n = data->matrix->size1 * data->matrix->size2; apop_matrix_mean_and_var(data->matrix, &mu_like, &var_like); } gsl_vector_set(outp->parameters->vector, 0, (mu_pri/var_pri + n*mu_like/var_like)/(1/var_pri + n/var_like)); gsl_vector_set(outp->parameters->vector, 1, pow((1/var_pri + n/var_like), -.5)); return outp; } /** Take in a prior and likelihood distribution, and output a posterior distribution. \li This function first checks a table of conjugate distributions for the pair you sent in. If the models are listed on the table, then the function returns a corresponding closed-form model with updated parameters. \li If the parameters aren't in the table of conjugate, and the prior distribution has a \c p or \c log_likelihood element, then use \ref apop_model_metropolis to generate the posterior. If you expect MCMC to run, you may add an \ref apop_mcmc_settings group to your prior to control the details of the search. See also the \ref apop_model_metropolis documentation. \li If the prior does not have a \c p or \c log_likelihood but does have a \c draw element, then make draws from the prior and weight them by the \c p given by the likelihood distribution. This is not a rejection sampling method, so the burnin is ignored. \param data The input data, that will be used by the likelihood function (default = \c NULL.) \param prior The prior \ref apop_model. If the system needs to estimate the posterior via MCMC, this needs to have a \c log_likelihood or \c p method. (No default, must not be \c NULL.) \param likelihood The likelihood \ref apop_model. If the system needs to estimate the posterior via MCMC, this needs to have a \c log_likelihood or \c p method (ll preferred). (No default, must not be \c NULL.) \param rng A \c gsl_rng, already initialized (e.g., via \ref apop_rng_alloc). (default: an RNG from \ref apop_rng_get_thread) \return an \ref apop_model struct representing the posterior, with updated parameters. \li In all cases, the output is a \ref apop_model that can be used as the input to this function, so you can chain Bayesian updating procedures. \li Here are the conjugate distributions currently defined: Likelihood Notes \ref apop_binomial "Binomial" \ref apop_bernoulli "Bernoulli" \ref apop_gamma "Gamma" Gamma likelihood represents the distribution of \f$\lambda^{-1}\f$, not plain \f$\lambda\f$ \ref apop_normal "Normal" Assumes prior with fixed \f$\sigma\f$; updates distribution for \f$\mu\f$ \ref apop_poisson "Poisson" Uses sum and size of the data
Prior
\ref apop_beta "Beta"
\ref apop_beta "Beta"
\ref apop_exponential "Exponential"
\ref apop_normal "Normal"
\ref apop_gamma "Gamma"
Here is a test function that compares the output via conjugate table and via Metropolis-Hastings sampling: \include test_updating.c \li The conjugate table is stored using a vtable; see \ref vtables for details. If you are writing a new vtable entry, the typedef new functions must conform to and the hash used for lookups are: \code typedef apop_model *(*apop_update_type)(apop_data *, apop_model , apop_model); #define apop_update_hash(m1, m2) ((size_t)(m1).draw + (size_t)((m2).log_likelihood ? (m2).log_likelihood : (m2).p)*33) \endcode \li This function uses the \ref designated syntax for inputs. */ #ifdef APOP_NO_VARIADIC apop_model * apop_update(apop_data *data, apop_model *prior, apop_model *likelihood, gsl_rng *rng){ #else apop_varad_head(apop_model *, apop_update){ apop_data *apop_varad_var(data, NULL); apop_model *apop_varad_var(prior, NULL); apop_model *apop_varad_var(likelihood, NULL); gsl_rng *apop_varad_var(rng, apop_rng_get_thread(-1)); return apop_update_base(data, prior, likelihood, rng); } apop_model * apop_update_base(apop_data *data, apop_model *prior, apop_model *likelihood, gsl_rng *rng){ #endif static int setup=0; if (!(setup++)){ apop_update_vtable_add(betabinom, apop_beta, apop_binomial); apop_update_vtable_add(betabernie, apop_beta, apop_bernoulli); apop_update_vtable_add(gammaexpo, apop_gamma, apop_exponential); apop_update_vtable_add(gammapoisson, apop_gamma, apop_poisson); apop_update_vtable_add(normnorm, apop_normal, apop_normal); } apop_update_type conj = apop_update_vtable_get(prior, likelihood); if (conj) return conj(data, prior, likelihood); apop_mcmc_settings *s = apop_settings_get_group(prior, apop_mcmc); apop_prep(NULL, prior); //probably a no-op apop_prep(data, likelihood); //probably a no-op gsl_vector *pack = apop_data_pack(likelihood->parameters); int tsize = pack->size; gsl_vector_free(pack); Apop_stopif(prior->dsize != tsize, return apop_model_copy(&(apop_model){.error='d'}), 0, "Size of a draw from the prior does not match " "the size of the likelihood's parameters (%i != %i).%s", prior->dsize, tsize, (tsize > prior->dsize) ? " Perhaps use apop_model_fix_params to reduce the " "likelihood's parameter count?" : ""); if (prior->p || prior->log_likelihood){ apop_model *p = apop_model_copy(product); //pending revision, a memory leak: p->more = malloc(sizeof(apop_model*)*2); ((apop_model**)p->more)[0] = apop_model_copy(prior); ((apop_model**)p->more)[1] = apop_model_copy(likelihood); p->more_size = sizeof(apop_model*) * 2; p->parameters = apop_data_alloc(prior->dsize); p->data = data; if (s) apop_settings_copy_group(p, prior, "apop_mcmc"); apop_model *out = apop_model_metropolis(data, rng, p); return out; } Apop_stopif(!prior->draw, return NULL, 0, "prior does not have a .p, .log_likelihood, or .draw element. I am stumped. Returning NULL."); if (!s) s = Apop_model_add_group(prior, apop_mcmc); gsl_vector *draw = gsl_vector_alloc(tsize); apop_data *out = apop_data_alloc(s->periods, tsize); out->weights = gsl_vector_alloc(s->periods); apop_draw(draw->data, rng, prior); //set starting point. apop_data_unpack(draw, likelihood->parameters); for (int i=0; i< s->periods; i++){ newdraw: apop_draw(draw->data, rng, prior); apop_data_unpack(draw, likelihood->parameters); long double p = apop_p(data, likelihood); Apop_notify(3, "p=%Lg for parameters:\t", p); if (apop_opts.verbose >=3) apop_data_print(likelihood->parameters); Apop_stopif(gsl_isnan(p), goto newdraw, 1, "Trouble evaluating the " "likelihood function at vector beginning with %g. " "Throwing it out and trying again.\n" , likelihood->parameters->vector->data[0]); apop_data_pack(likelihood->parameters, Apop_rv(out, i)); gsl_vector_set(out->weights, i, p); } apop_model *outp = apop_estimate(out, apop_pmf); gsl_vector_free(draw); return outp; } apophenia-1.0+ds/apop_vtables.c000066400000000000000000000061251262736346100165610ustar00rootroot00000000000000#include #include #include "apop_internal.h" //just for OMP_critical #ifdef _OPENMP #include #define lock omp_set_lock(&v->mutex); #define unlock omp_unset_lock(&v->mutex); #else #define lock #define unlock #endif /** \cond doxy_ignore */ typedef struct { size_t hash; void *fn; } apop_vtable_elmt_s; typedef struct { char const *name; unsigned long int hashed_name; int elmt_ct; apop_vtable_elmt_s *elmts; #ifdef _OPENMP omp_lock_t mutex; #endif } apop_vtable_s; apop_vtable_s *vtable_list; int ignore_me; /** \endcond */ //End of Doxygen ignore. //The Dan J Bernstein string hashing algorithm. static unsigned long apop_settings_hash(char const *str){ unsigned long int hash = 5381; char c; while ((c = *str++)) hash = hash*33 + c; return hash; } static apop_vtable_s *find_tab(unsigned long h, int *ctr){ apop_vtable_s *v = vtable_list; *ctr = 0; for ( ; v->hashed_name; (*ctr)++, v++) if (v->hashed_name== h) break; return v; } //return 0 = found; removed //return 1 = not found; no-op int apop_vtable_drop(char const *tabname, unsigned long hash){ if (!vtable_list) return 1; unsigned long h = apop_settings_hash(tabname); apop_vtable_s *v = find_tab(h, &ignore_me); lock for (int i=0; i< v->elmt_ct; i++) if (hash == v->elmts[i].hash) { memmove(v->elmts+i, v->elmts+i+1, sizeof(apop_vtable_elmt_s)*(v->elmt_ct-i)); v->elmt_ct--; unlock return 0; } unlock return 1; } int apop_vtable_add(char const *tabname, void *fn_in, unsigned long hash){ if (!vtable_list){vtable_list = calloc(1, sizeof(apop_vtable_s));} unsigned long h = apop_settings_hash(tabname); int ctr; apop_vtable_s *v; //add a table if need be. OMP_critical (new_vtable) { v = find_tab(h, &ctr); if (!v->hashed_name){ vtable_list = realloc(vtable_list, (ctr+2)* sizeof(apop_vtable_s)); vtable_list[ctr] = (apop_vtable_s){.elmts=calloc(1, sizeof(apop_vtable_elmt_s))}; vtable_list[ctr+1] = (apop_vtable_s){ }; #ifdef _OPENMP omp_init_lock(&vtable_list[ctr].mutex); omp_set_lock(&vtable_list[ctr].mutex); #endif vtable_list[ctr].name = tabname; vtable_list[ctr].hashed_name = h; v = vtable_list+ctr; unlock } } lock //If this hash is already present, don't re-add. for (int i=0; i< v->elmt_ct; i++) if (hash == v->elmts[i].hash) {unlock; return 0;} //insert v->elmts = realloc(v->elmts, (++(v->elmt_ct))* sizeof(apop_vtable_elmt_s)); v->elmts[v->elmt_ct-1] = (apop_vtable_elmt_s){.hash=hash, .fn=fn_in}; unlock return 0; } void *apop_vtable_get(char const *tabname, unsigned long hash){ if (!vtable_list) return NULL; unsigned long thash = apop_settings_hash(tabname); apop_vtable_s *v = find_tab(thash, &ignore_me); if (!v->hashed_name) return NULL; lock for (int i=0; i< v->elmt_ct; i++) if (hash == v->elmts[i].hash) {unlock; return v->elmts[i].fn;} unlock return NULL; } apophenia-1.0+ds/apophenia.map000066400000000000000000000172021262736346100163770ustar00rootroot00000000000000LIBAPOPHENIA_2.0.0 { global: apop_opts; apop_name_alloc; apop_name_add; apop_name_free; apop_name_print; apop_name_stack_base; variadic_apop_name_stack; apop_name_copy; apop_name_find; apop_data_add_names_base; apop_data_free_base; apop_data_alloc_base; variadic_apop_data_alloc; apop_data_calloc_base; variadic_apop_data_calloc; apop_data_stack_base; variadic_apop_data_stack; apop_data_split; apop_data_copy; apop_data_rm_columns; apop_data_memcpy; apop_data_ptr_base; variadic_apop_data_ptr; apop_data_get_base; variadic_apop_data_get; apop_data_set_base; variadic_apop_data_set; apop_data_add_named_elmt; apop_text_set; apop_text_alloc; apop_text_free; apop_data_transpose_base; variadic_apop_data_transpose; apop_matrix_realloc; apop_vector_realloc; apop_data_prune_columns_base; apop_data_get_page_base; variadic_apop_data_get_page; apop_data_add_page; apop_data_rm_page_base; variadic_apop_data_rm_page; apop_data_rm_rows_base; variadic_apop_data_rm_rows; apop_model_draws_base; variadic_apop_model_draws; apop_vector_copy; apop_vector_to_matrix_base; variadic_apop_vector_to_matrix; apop_matrix_copy; apop_db_to_crosstab_base; variadic_apop_db_to_crosstab; apop_array_to_vector_base; variadic_apop_array_to_vector; apop_text_to_data_base; variadic_apop_text_to_data; apop_text_to_db_base; variadic_apop_text_to_db; apop_data_rank_expand; apop_data_rank_compress_base; variadic_apop_data_rank_compress; apop_crosstab_to_db; apop_data_pack_base; variadic_apop_data_pack; apop_data_unpack_base; variadic_apop_data_unpack; apop_data_fill_base; apop_vector_fill_base; apop_text_fill_base; apop_beta; apop_bernoulli; apop_binomial; apop_chi_squared; apop_dirichlet; apop_exponential; apop_f_distribution; apop_gamma; apop_improper_uniform; apop_iv; apop_kernel_density; apop_loess; apop_logit; apop_lognormal; apop_multinomial; apop_multivariate_normal; apop_normal; apop_ols; apop_pmf; apop_poisson; apop_probit; apop_t_distribution; apop_uniform; apop_wls; apop_yule; apop_zipf; apop_coordinate_transform; apop_composition; apop_dconstrain; apop_mixture; apop_cross; apop_model_free; variadic_apop_model_print; apop_model_print_base; apop_model_show; apop_model_copy; apop_model_clear; apop_estimate; apop_score; apop_log_likelihood; apop_p; apop_cdf; apop_draw; apop_prep; apop_parameter_model; apop_predict; apop_beta_from_mean_var; apop_model_set_parameters_base; apop_model_mixture_base; apop_model_cross_base; apop_map_base; variadic_apop_map; apop_map_sum_base; variadic_apop_map_sum; apop_matrix_map; apop_vector_map; apop_matrix_apply; apop_vector_apply; apop_matrix_map_all; apop_matrix_apply_all; apop_vector_map_sum; apop_matrix_map_sum; apop_matrix_map_all_sum; apop_matrix_print_base; variadic_apop_matrix_print; apop_vector_print_base; variadic_apop_vector_print; apop_data_print_base; variadic_apop_data_print; apop_matrix_show; apop_vector_show; apop_data_show; apop_vector_mean_base; variadic_apop_vector_mean; apop_vector_var_base; variadic_apop_vector_var; apop_vector_skew_pop_base; variadic_apop_vector_skew_pop; apop_vector_kurtosis_pop_base; variadic_apop_vector_kurtosis_pop; apop_vector_cov_base; variadic_apop_vector_cov; apop_vector_distance_base; variadic_apop_vector_distance; apop_vector_normalize_base; variadic_apop_vector_normalize; apop_data_covariance; apop_data_correlation; apop_vector_entropy; apop_matrix_sum; apop_matrix_mean; apop_matrix_mean_and_var; apop_data_summarize; apop_vector_percentiles_base; variadic_apop_vector_percentiles; apop_test_fisher_exact; apop_matrix_is_positive_semidefinite_base; variadic_apop_matrix_is_positive_semidefinite; apop_matrix_to_positive_semidefinite; apop_multivariate_gamma; apop_multivariate_lngamma; apop_t_test; apop_paired_t_test; apop_anova_base; variadic_apop_anova; apop_f_test_base; variadic_apop_f_test; apop_text_unique_elements; apop_vector_unique_elements; apop_data_to_factors_base; variadic_apop_data_to_factors; apop_data_get_factor_names_base; variadic_apop_data_get_factor_names; apop_data_to_dummies_base; variadic_apop_data_to_dummies; apop_model_entropy_base; variadic_apop_model_entropy; apop_kl_divergence_base; variadic_apop_kl_divergence; apop_estimate_coefficient_of_determination; apop_estimate_parameter_tests; apop_jackknife_cov; apop_bootstrap_cov_base; variadic_apop_bootstrap_cov; apop_rng_alloc; apop_rng_GHgB3; apop_rng_get_thread_base; apop_arms_draw; apop_numerical_gradient_base; variadic_apop_numerical_gradient; apop_model_hessian_base; variadic_apop_model_hessian; apop_model_numerical_covariance_base; variadic_apop_model_numerical_covariance; apop_maximum_likelihood; apop_estimate_restart_base; variadic_apop_estimate_restart; apop_linear_constraint_base; variadic_apop_linear_constraint; apop_model_fix_params; apop_model_fix_params_get_base; apop_vtable_add; apop_vtable_get; apop_vtable_drop; apop_update_type_check; apop_entropy_type_check; apop_score_type_check; apop_parameter_model_type_check; apop_predict_type_check; apop_model_print_type_check; apop_generalized_harmonic; apop_test_anova_independence; apop_regex_base; variadic_apop_regex; apop_system; apop_vector_moving_average; apop_histograms_test_goodness_of_fit; apop_test_kolmogorov; apop_data_pmf_compress; apop_data_to_bins_base; variadic_apop_data_to_bins; apop_model_to_pmf_base; variadic_apop_model_to_pmf; apop_text_paste_base; variadic_apop_text_paste; apop_data_listwise_delete_base; variadic_apop_data_listwise_delete; apop_ml_impute; apop_model_metropolis_base; variadic_apop_model_metropolis; apop_update_base; variadic_apop_update; apop_test_base; variadic_apop_test; apop_data_sort_base; variadic_apop_data_sort; apop_rake_base; variadic_apop_rake; apop_det_and_inv; apop_dot_base; variadic_apop_dot; apop_vector_bounded_base; variadic_apop_vector_bounded; apop_matrix_inverse; apop_matrix_determinant; apop_matrix_pca_base; variadic_apop_matrix_pca; apop_vector_stack_base; variadic_apop_vector_stack; apop_matrix_stack_base; variadic_apop_matrix_stack; apop_vector_log; apop_vector_log10; apop_vector_exp; apop_vector_sum; apop_vector_var_m; apop_vector_correlation_base; variadic_apop_vector_correlation; apop_vector_kurtosis; apop_vector_skew; apop_table_exists_base; variadic_apop_table_exists; apop_db_open; apop_db_close_base; variadic_apop_db_close; apop_query; apop_query_to_text; apop_query_to_data; apop_query_to_mixed_data; apop_query_to_vector; apop_query_to_float; apop_data_to_db; apop_settings_get_grp; apop_settings_remove_group; apop_settings_copy_group; apop_settings_group_alloc; apop_settings_group_alloc_wm; apop_lm_settings_init; apop_lm_settings_copy; apop_lm_settings_free; apop_pm_settings_init; apop_pm_settings_copy; apop_pm_settings_free; apop_pmf_settings_init; apop_pmf_settings_copy; apop_pmf_settings_free; apop_mle_settings_init; apop_mle_settings_copy; apop_mle_settings_free; apop_cdf_settings_init; apop_cdf_settings_copy; apop_cdf_settings_free; apop_arms_settings_init; apop_arms_settings_copy; apop_arms_settings_free; apop_mcmc_settings_init; apop_mcmc_settings_copy; apop_mcmc_settings_free; apop_loess_settings_init; apop_loess_settings_copy; apop_loess_settings_free; apop_cross_settings_init; apop_cross_settings_copy; apop_cross_settings_free; apop_mixture_settings_init; apop_mixture_settings_copy; apop_mixture_settings_free; apop_dconstrain_settings_init; apop_dconstrain_settings_copy; apop_dconstrain_settings_free; apop_composition_settings_init; apop_composition_settings_copy; apop_composition_settings_free; apop_parts_wanted_settings_init; apop_parts_wanted_settings_copy; apop_parts_wanted_settings_free; apop_kernel_density_settings_init; apop_kernel_density_settings_copy; apop_kernel_density_settings_free; apop_coordinate_transform_settings_init; apop_coordinate_transform_settings_copy; apop_coordinate_transform_settings_free; local: *; }; apophenia-1.0+ds/apophenia.pc.in000066400000000000000000000005001262736346100166220ustar00rootroot00000000000000prefix=@prefix@ exec_prefix=@exec_prefix@ bindir=@bindir@ libdir=@libdir@ includedir=@includedir@ Name: Apophenia Description: The Apophenia library URL: http://apophenia.info/ Requires: gsl Requires.private: sqlite3 Version: @VERSION@ Cflags: @MYSQL_CFLAGS@ Libs: -L${libdir} -lapophenia Libs.private: @MYSQL_LDFLAGS@ apophenia-1.0+ds/asprintf.c000066400000000000000000001265141262736346100157350ustar00rootroot00000000000000 /** \cond doxy_ignore */ /* Formatted output to strings. Copyright (C) 1999, 2002 Free Software Foundation, Inc. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ /* Mashed into one file by BK */ /* Tell glibc's to provide a prototype for snprintf(). This must come before because may include , and once has been included, it's too late. */ #ifndef _GNU_SOURCE # define _GNU_SOURCE 1 #endif /* vsprintf with automatic memory allocation. */ #include #include #include #ifdef HAVE_CONFIG_H # include #endif #if HAVE_ASPRINTF #else #ifndef __attribute__ /* This feature is available in gcc versions 2.5 and later. */ # if __GNUC__ < 2 || (__GNUC__ == 2 && __GNUC_MINOR__ < 5) || __STRICT_ANSI__ # define __attribute__(Spec) /* empty */ # endif /* The __-protected variants of `format' and `printf' attributes are accepted by gcc versions 2.6.4 (effectively 2.7) and later. */ # if __GNUC__ < 2 || (__GNUC__ == 2 && __GNUC_MINOR__ < 7) # define __format__ format # define __printf__ printf # endif #endif #ifdef __cplusplus extern "C" { #endif /* Write formatted output to a string dynamically allocated with malloc(). If the memory allocation succeeds, store the address of the string in *RESULT and return the number of resulting bytes, excluding the trailing NUL. Upon memory allocation error, or some other error, return -1. */ extern int asprintf (char **result, const char *format, ...) __attribute__ ((__format__ (__printf__, 2, 3))); extern int vasprintf (char **result, const char *format, va_list args) __attribute__ ((__format__ (__printf__, 2, 0))); //from vasnprintf extern char * asnprintf (char *resultbuf, size_t *lengthp, const char *format, ...) __attribute__ ((__format__ (__printf__, 3, 4))); extern char * vasnprintf (char *resultbuf, size_t *lengthp, const char *format, va_list args) __attribute__ ((__format__ (__printf__, 3, 0))); #ifdef __cplusplus } #endif /* Decomposed printf argument list. */ /* Get wint_t. */ #ifdef HAVE_WINT_T # include #endif /* Argument types */ typedef enum { TYPE_NONE, TYPE_SCHAR, TYPE_UCHAR, TYPE_SHORT, TYPE_USHORT, TYPE_INT, TYPE_UINT, TYPE_LONGINT, TYPE_ULONGINT, #ifdef HAVE_LONG_LONG TYPE_LONGLONGINT, TYPE_ULONGLONGINT, #endif TYPE_DOUBLE, #ifdef HAVE_LONG_DOUBLE TYPE_LONGDOUBLE, #endif TYPE_CHAR, #ifdef HAVE_WINT_T TYPE_WIDE_CHAR, #endif TYPE_STRING, #ifdef HAVE_WCHAR_T TYPE_WIDE_STRING, #endif TYPE_POINTER, TYPE_COUNT_SCHAR_POINTER, TYPE_COUNT_SHORT_POINTER, TYPE_COUNT_INT_POINTER, TYPE_COUNT_LONGINT_POINTER #ifdef HAVE_LONG_LONG , TYPE_COUNT_LONGLONGINT_POINTER #endif } arg_type; /* Polymorphic argument */ typedef struct { arg_type type; union { signed char a_schar; unsigned char a_uchar; short a_short; unsigned short a_ushort; int a_int; unsigned int a_uint; long int a_longint; unsigned long int a_ulongint; #ifdef HAVE_LONG_LONG long long int a_longlongint; unsigned long long int a_ulonglongint; #endif float a_float; double a_double; #ifdef HAVE_LONG_DOUBLE long double a_longdouble; #endif int a_char; #ifdef HAVE_WINT_T wint_t a_wide_char; #endif const char* a_string; #ifdef HAVE_WCHAR_T const wchar_t* a_wide_string; #endif void* a_pointer; signed char * a_count_schar_pointer; short * a_count_short_pointer; int * a_count_int_pointer; long int * a_count_longint_pointer; #ifdef HAVE_LONG_LONG long long int * a_count_longlongint_pointer; #endif } a; } argument; typedef struct { size_t count; argument *arg; } arguments; /* Fetch the arguments, putting them into a. */ #ifdef STATIC STATIC #else extern #endif int printf_fetchargs (va_list args, arguments *a); /* Parse printf format string. */ /* Flags */ #define FLAG_GROUP 1 /* ' flag */ #define FLAG_LEFT 2 /* - flag */ #define FLAG_SHOWSIGN 4 /* + flag */ #define FLAG_SPACE 8 /* space flag */ #define FLAG_ALT 16 /* # flag */ #define FLAG_ZERO 32 /* arg_index value indicating that no argument is consumed. */ #define ARG_NONE (~(size_t)0) /* A parsed directive. */ typedef struct { const char* dir_start; const char* dir_end; int flags; const char* width_start; const char* width_end; size_t width_arg_index; const char* precision_start; const char* precision_end; size_t precision_arg_index; char conversion; /* d i o u x X f e E g G c s p n U % but not C S */ size_t arg_index; } char_directive; /* A parsed format string. */ typedef struct { size_t count; char_directive *dir; size_t max_width_length; size_t max_precision_length; } char_directives; /* Parses the format string. Fills in the number N of directives, and fills in directives[0], ..., directives[N-1], and sets directives[N].dir_start to the end of the format string. Also fills in the arg_type fields of the arguments and the needed count of arguments. */ #ifdef STATIC STATIC #else extern #endif int printf_parse (const char *format, char_directives *d, arguments *a); /*end headers */ char * asnprintf (char *resultbuf, size_t *lengthp, const char *format, ...) { va_list args; char *result; va_start (args, format); result = vasnprintf (resultbuf, lengthp, format, args); va_end (args); return result; } int asprintf (char **resultp, const char *format, ...) { va_list args; int result; va_start (args, format); result = vasprintf (resultp, format, args); va_end (args); return result; } #ifdef STATIC STATIC #endif int printf_fetchargs (va_list args, arguments *a) { size_t i; argument *ap; for (i = 0, ap = &a->arg[0]; i < a->count; i++, ap++) switch (ap->type) { case TYPE_SCHAR: ap->a.a_schar = va_arg (args, /*signed char*/ int); break; case TYPE_UCHAR: ap->a.a_uchar = va_arg (args, /*unsigned char*/ int); break; case TYPE_SHORT: ap->a.a_short = va_arg (args, /*short*/ int); break; case TYPE_USHORT: ap->a.a_ushort = va_arg (args, /*unsigned short*/ int); break; case TYPE_INT: ap->a.a_int = va_arg (args, int); break; case TYPE_UINT: ap->a.a_uint = va_arg (args, unsigned int); break; case TYPE_LONGINT: ap->a.a_longint = va_arg (args, long int); break; case TYPE_ULONGINT: ap->a.a_ulongint = va_arg (args, unsigned long int); break; #ifdef HAVE_LONG_LONG case TYPE_LONGLONGINT: ap->a.a_longlongint = va_arg (args, long long int); break; case TYPE_ULONGLONGINT: ap->a.a_ulonglongint = va_arg (args, unsigned long long int); break; #endif case TYPE_DOUBLE: ap->a.a_double = va_arg (args, double); break; #ifdef HAVE_LONG_DOUBLE case TYPE_LONGDOUBLE: ap->a.a_longdouble = va_arg (args, long double); break; #endif case TYPE_CHAR: ap->a.a_char = va_arg (args, int); break; #ifdef HAVE_WINT_T case TYPE_WIDE_CHAR: ap->a.a_wide_char = va_arg (args, wint_t); break; #endif case TYPE_STRING: ap->a.a_string = va_arg (args, const char *); break; #ifdef HAVE_WCHAR_T case TYPE_WIDE_STRING: ap->a.a_wide_string = va_arg (args, const wchar_t *); break; #endif case TYPE_POINTER: ap->a.a_pointer = va_arg (args, void *); break; case TYPE_COUNT_SCHAR_POINTER: ap->a.a_count_schar_pointer = va_arg (args, signed char *); break; case TYPE_COUNT_SHORT_POINTER: ap->a.a_count_short_pointer = va_arg (args, short *); break; case TYPE_COUNT_INT_POINTER: ap->a.a_count_int_pointer = va_arg (args, int *); break; case TYPE_COUNT_LONGINT_POINTER: ap->a.a_count_longint_pointer = va_arg (args, long int *); break; #ifdef HAVE_LONG_LONG case TYPE_COUNT_LONGLONGINT_POINTER: ap->a.a_count_longlongint_pointer = va_arg (args, long long int *); break; #endif default: /* Unknown type. */ return -1; } return 0; } /* Get intmax_t. */ #if HAVE_STDINT_H_WITH_UINTMAX # include #endif #if HAVE_INTTYPES_H_WITH_UINTMAX # include #endif /* malloc(), realloc(), free(). */ #include /* xsize.h -- Checked size_t computations. */ #ifndef _XSIZE_H #define _XSIZE_H /* Get SIZE_MAX. */ #include #if HAVE_STDINT_H # include #endif /* The size of memory objects is often computed through expressions of type size_t. Example: void* p = malloc (header_size + n * element_size). These computations can lead to overflow. When this happens, malloc() returns a piece of memory that is way too small, and the program then crashes while attempting to fill the memory. To avoid this, the functions and macros in this file check for overflow. The convention is that SIZE_MAX represents overflow. malloc (SIZE_MAX) is not guaranteed to fail -- think of a malloc implementation that uses mmap --, it's recommended to use size_overflow_p() or size_in_bounds_p() before invoking malloc(). The example thus becomes: size_t size = xsum (header_size, xtimes (n, element_size)); void *p = (size_in_bounds_p (size) ? malloc (size) : NULL); */ /* Convert an arbitrary value >= 0 to type size_t. */ #define xcast_size_t(N) \ ((N) <= SIZE_MAX ? (size_t) (N) : SIZE_MAX) /* Sum of two sizes, with overflow check. */ static inline size_t #if __GNUC__ >= 3 __attribute__ ((__pure__)) #endif xsum (size_t size1, size_t size2) { size_t sum = size1 + size2; return (sum >= size1 ? sum : SIZE_MAX); } /* Sum of three sizes, with overflow check. */ static inline size_t #if __GNUC__ >= 3 __attribute__ ((__pure__)) #endif xsum3 (size_t size1, size_t size2, size_t size3) { return xsum (xsum (size1, size2), size3); } /* Sum of four sizes, with overflow check. */ static inline size_t #if __GNUC__ >= 3 __attribute__ ((__pure__)) #endif xsum4 (size_t size1, size_t size2, size_t size3, size_t size4) { return xsum (xsum (xsum (size1, size2), size3), size4); } /* Maximum of two sizes, with overflow check. */ static inline size_t #if __GNUC__ >= 3 __attribute__ ((__pure__)) #endif xmax (size_t size1, size_t size2) { /* No explicit check is needed here, because for any n: max (SIZE_MAX, n) == SIZE_MAX and max (n, SIZE_MAX) == SIZE_MAX. */ return (size1 >= size2 ? size1 : size2); } /* Multiplication of a count with an element size, with overflow check. The count must be >= 0 and the element size must be > 0. This is a macro, not an inline function, so that it works correctly even when N is of a wider tupe and N > SIZE_MAX. */ #define xtimes(N, ELSIZE) \ ((N) <= SIZE_MAX / (ELSIZE) ? (size_t) (N) * (ELSIZE) : SIZE_MAX) /* Check for overflow. */ #define size_overflow_p(SIZE) \ ((SIZE) == SIZE_MAX) /* Check against overflow. */ #define size_in_bounds_p(SIZE) \ ((SIZE) != SIZE_MAX) #endif /* _XSIZE_H */ #if WIDE_CHAR_VERSION # define PRINTF_PARSE wprintf_parse # define CHAR_T wchar_t # define DIRECTIVE wchar_t_directive # define DIRECTIVES wchar_t_directives #else # define PRINTF_PARSE printf_parse # define CHAR_T char # define DIRECTIVE char_directive # define DIRECTIVES char_directives #endif #ifdef STATIC STATIC #endif int PRINTF_PARSE (const CHAR_T *format, DIRECTIVES *d, arguments *a) { const CHAR_T *cp = format; /* pointer into format */ size_t arg_posn = 0; /* number of regular arguments consumed */ size_t d_allocated; /* allocated elements of d->dir */ size_t a_allocated; /* allocated elements of a->arg */ size_t max_width_length = 0; size_t max_precision_length = 0; d->count = 0; d_allocated = 1; d->dir = malloc (d_allocated * sizeof (DIRECTIVE)); if (d->dir == NULL) /* Out of memory. */ return -1; a->count = 0; a_allocated = 0; a->arg = NULL; #define REGISTER_ARG(_index_,_type_) \ { \ size_t n = (_index_); \ if (n >= a_allocated) \ { \ size_t memory_size; \ argument *memory; \ \ a_allocated = xtimes (a_allocated, 2); \ if (a_allocated <= n) \ a_allocated = xsum (n, 1); \ memory_size = xtimes (a_allocated, sizeof (argument)); \ if (size_overflow_p (memory_size)) \ /* Overflow, would lead to out of memory. */ \ goto error; \ memory = (a->arg \ ? realloc (a->arg, memory_size) \ : malloc (memory_size)); \ if (memory == NULL) \ /* Out of memory. */ \ goto error; \ a->arg = memory; \ } \ while (a->count <= n) \ a->arg[a->count++].type = TYPE_NONE; \ if (a->arg[n].type == TYPE_NONE) \ a->arg[n].type = (_type_); \ else if (a->arg[n].type != (_type_)) \ /* Ambiguous type for positional argument. */ \ goto error; \ } while (*cp != '\0') { CHAR_T c = *cp++; if (c == '%') { size_t arg_index = ARG_NONE; DIRECTIVE *dp = &d->dir[d->count];/* pointer to next directive */ /* Initialize the next directive. */ dp->dir_start = cp - 1; dp->flags = 0; dp->width_start = NULL; dp->width_end = NULL; dp->width_arg_index = ARG_NONE; dp->precision_start = NULL; dp->precision_end = NULL; dp->precision_arg_index = ARG_NONE; dp->arg_index = ARG_NONE; /* Test for positional argument. */ if (*cp >= '0' && *cp <= '9') { const CHAR_T *np; for (np = cp; *np >= '0' && *np <= '9'; np++) ; if (*np == '$') { size_t n = 0; for (np = cp; *np >= '0' && *np <= '9'; np++) n = xsum (xtimes (n, 10), *np - '0'); if (n == 0) /* Positional argument 0. */ goto error; if (size_overflow_p (n)) /* n too large, would lead to out of memory later. */ goto error; arg_index = n - 1; cp = np + 1; } } /* Read the flags. */ for (;;) { if (*cp == '\'') { dp->flags |= FLAG_GROUP; cp++; } else if (*cp == '-') { dp->flags |= FLAG_LEFT; cp++; } else if (*cp == '+') { dp->flags |= FLAG_SHOWSIGN; cp++; } else if (*cp == ' ') { dp->flags |= FLAG_SPACE; cp++; } else if (*cp == '#') { dp->flags |= FLAG_ALT; cp++; } else if (*cp == '0') { dp->flags |= FLAG_ZERO; cp++; } else break; } /* Parse the field width. */ if (*cp == '*') { dp->width_start = cp; cp++; dp->width_end = cp; if (max_width_length < 1) max_width_length = 1; /* Test for positional argument. */ if (*cp >= '0' && *cp <= '9') { const CHAR_T *np; for (np = cp; *np >= '0' && *np <= '9'; np++) ; if (*np == '$') { size_t n = 0; for (np = cp; *np >= '0' && *np <= '9'; np++) n = xsum (xtimes (n, 10), *np - '0'); if (n == 0) /* Positional argument 0. */ goto error; if (size_overflow_p (n)) /* n too large, would lead to out of memory later. */ goto error; dp->width_arg_index = n - 1; cp = np + 1; } } if (dp->width_arg_index == ARG_NONE) { dp->width_arg_index = arg_posn++; if (dp->width_arg_index == ARG_NONE) /* arg_posn wrapped around. */ goto error; } REGISTER_ARG (dp->width_arg_index, TYPE_INT); } else if (*cp >= '0' && *cp <= '9') { size_t width_length; dp->width_start = cp; for (; *cp >= '0' && *cp <= '9'; cp++) ; dp->width_end = cp; width_length = dp->width_end - dp->width_start; if (max_width_length < width_length) max_width_length = width_length; } /* Parse the precision. */ if (*cp == '.') { cp++; if (*cp == '*') { dp->precision_start = cp - 1; cp++; dp->precision_end = cp; if (max_precision_length < 2) max_precision_length = 2; /* Test for positional argument. */ if (*cp >= '0' && *cp <= '9') { const CHAR_T *np; for (np = cp; *np >= '0' && *np <= '9'; np++) ; if (*np == '$') { size_t n = 0; for (np = cp; *np >= '0' && *np <= '9'; np++) n = xsum (xtimes (n, 10), *np - '0'); if (n == 0) /* Positional argument 0. */ goto error; if (size_overflow_p (n)) /* n too large, would lead to out of memory later. */ goto error; dp->precision_arg_index = n - 1; cp = np + 1; } } if (dp->precision_arg_index == ARG_NONE) { dp->precision_arg_index = arg_posn++; if (dp->precision_arg_index == ARG_NONE) /* arg_posn wrapped around. */ goto error; } REGISTER_ARG (dp->precision_arg_index, TYPE_INT); } else { size_t precision_length; dp->precision_start = cp - 1; for (; *cp >= '0' && *cp <= '9'; cp++) ; dp->precision_end = cp; precision_length = dp->precision_end - dp->precision_start; if (max_precision_length < precision_length) max_precision_length = precision_length; } } { arg_type type; /* Parse argument type/size specifiers. */ { int flags = 0; for (;;) { if (*cp == 'h') { flags |= (1 << (flags & 1)); cp++; } else if (*cp == 'L') { flags |= 4; cp++; } else if (*cp == 'l') { flags += 8; cp++; } #ifdef HAVE_INTMAX_T else if (*cp == 'j') { if (sizeof (intmax_t) > sizeof (long)) { /* intmax_t = long long */ flags += 16; } else if (sizeof (intmax_t) > sizeof (int)) { /* intmax_t = long */ flags += 8; } cp++; } #endif else if (*cp == 'z' || *cp == 'Z') { /* 'z' is standardized in ISO C 99, but glibc uses 'Z' because the warning facility in gcc-2.95.2 understands only 'Z' (see gcc-2.95.2/gcc/c-common.c:1784). */ if (sizeof (size_t) > sizeof (long)) { /* size_t = long long */ flags += 16; } else if (sizeof (size_t) > sizeof (int)) { /* size_t = long */ flags += 8; } cp++; } else if (*cp == 't') { if (sizeof (ptrdiff_t) > sizeof (long)) { /* ptrdiff_t = long long */ flags += 16; } else if (sizeof (ptrdiff_t) > sizeof (int)) { /* ptrdiff_t = long */ flags += 8; } cp++; } else break; } /* Read the conversion character. */ c = *cp++; switch (c) { case 'd': case 'i': #ifdef HAVE_LONG_LONG if (flags >= 16 || (flags & 4)) type = TYPE_LONGLONGINT; else #endif if (flags >= 8) type = TYPE_LONGINT; else if (flags & 2) type = TYPE_SCHAR; else if (flags & 1) type = TYPE_SHORT; else type = TYPE_INT; break; case 'o': case 'u': case 'x': case 'X': #ifdef HAVE_LONG_LONG if (flags >= 16 || (flags & 4)) type = TYPE_ULONGLONGINT; else #endif if (flags >= 8) type = TYPE_ULONGINT; else if (flags & 2) type = TYPE_UCHAR; else if (flags & 1) type = TYPE_USHORT; else type = TYPE_UINT; break; case 'f': case 'F': case 'e': case 'E': case 'g': case 'G': case 'a': case 'A': #ifdef HAVE_LONG_DOUBLE if (flags >= 16 || (flags & 4)) type = TYPE_LONGDOUBLE; else #endif type = TYPE_DOUBLE; break; case 'c': if (flags >= 8) #ifdef HAVE_WINT_T type = TYPE_WIDE_CHAR; #else goto error; #endif else type = TYPE_CHAR; break; #ifdef HAVE_WINT_T case 'C': type = TYPE_WIDE_CHAR; c = 'c'; break; #endif case 's': if (flags >= 8) #ifdef HAVE_WCHAR_T type = TYPE_WIDE_STRING; #else goto error; #endif else type = TYPE_STRING; break; #ifdef HAVE_WCHAR_T case 'S': type = TYPE_WIDE_STRING; c = 's'; break; #endif case 'p': type = TYPE_POINTER; break; case 'n': #ifdef HAVE_LONG_LONG if (flags >= 16 || (flags & 4)) type = TYPE_COUNT_LONGLONGINT_POINTER; else #endif if (flags >= 8) type = TYPE_COUNT_LONGINT_POINTER; else if (flags & 2) type = TYPE_COUNT_SCHAR_POINTER; else if (flags & 1) type = TYPE_COUNT_SHORT_POINTER; else type = TYPE_COUNT_INT_POINTER; break; case '%': type = TYPE_NONE; break; default: /* Unknown conversion character. */ goto error; } } if (type != TYPE_NONE) { dp->arg_index = arg_index; if (dp->arg_index == ARG_NONE) { dp->arg_index = arg_posn++; if (dp->arg_index == ARG_NONE) /* arg_posn wrapped around. */ goto error; } REGISTER_ARG (dp->arg_index, type); } dp->conversion = c; dp->dir_end = cp; } d->count++; if (d->count >= d_allocated) { size_t memory_size; DIRECTIVE *memory; d_allocated = xtimes (d_allocated, 2); memory_size = xtimes (d_allocated, sizeof (DIRECTIVE)); if (size_overflow_p (memory_size)) /* Overflow, would lead to out of memory. */ goto error; memory = realloc (d->dir, memory_size); if (memory == NULL) /* Out of memory. */ goto error; d->dir = memory; } } } d->dir[d->count].dir_start = cp; d->max_width_length = max_width_length; d->max_precision_length = max_precision_length; return 0; error: if (a->arg) free (a->arg); if (d->dir) free (d->dir); return -1; } #undef DIRECTIVES #undef DIRECTIVE #undef CHAR_T #undef PRINTF_PARSE #ifndef IN_LIBINTL # include #endif #include /* memcpy(), strlen() */ #include /* errno */ #include /* DBL_MAX_EXP, LDBL_MAX_EXP */ /* Some systems, like OSF/1 4.0 and Woe32, don't have EOVERFLOW. */ #ifndef EOVERFLOW # define EOVERFLOW E2BIG #endif #ifdef HAVE_WCHAR_T # ifdef HAVE_WCSLEN # define local_wcslen wcslen # else /* Solaris 2.5.1 has wcslen() in a separate library libw.so. To avoid a dependency towards this library, here is a local substitute. Define this substitute only once, even if this file is included twice in the same compilation unit. */ # ifndef local_wcslen_defined # define local_wcslen_defined 1 static size_t local_wcslen (const wchar_t *s) { const wchar_t *ptr; for (ptr = s; *ptr != (wchar_t) 0; ptr++) ; return ptr - s; } # endif # endif #endif #if WIDE_CHAR_VERSION # define VASNPRINTF vasnwprintf # define CHAR_T wchar_t # define DIRECTIVE wchar_t_directive # define DIRECTIVES wchar_t_directives # define PRINTF_PARSE wprintf_parse # define USE_SNPRINTF 1 # if HAVE_DECL__SNWPRINTF /* On Windows, the function swprintf() has a different signature than on Unix; we use the _snwprintf() function instead. */ # define SNPRINTF _snwprintf # else /* Unix. */ # define SNPRINTF swprintf # endif #else # define VASNPRINTF vasnprintf # define CHAR_T char # define DIRECTIVE char_directive # define DIRECTIVES char_directives # define PRINTF_PARSE printf_parse # define USE_SNPRINTF (HAVE_DECL__SNPRINTF || HAVE_SNPRINTF) # if HAVE_DECL__SNPRINTF /* Windows. */ # define SNPRINTF _snprintf # else /* Unix. */ # define SNPRINTF snprintf # endif #endif CHAR_T * VASNPRINTF (CHAR_T *resultbuf, size_t *lengthp, const CHAR_T *format, va_list args) { DIRECTIVES d; arguments a; if (PRINTF_PARSE (format, &d, &a) < 0) { errno = EINVAL; return NULL; } #define CLEANUP() \ free (d.dir); \ if (a.arg) \ free (a.arg); if (printf_fetchargs (args, &a) < 0) { CLEANUP (); errno = EINVAL; return NULL; } { size_t buf_neededlength; CHAR_T *buf; CHAR_T *buf_malloced; const CHAR_T *cp; size_t i; DIRECTIVE *dp; /* Output string accumulator. */ CHAR_T *result; size_t allocated; size_t length; /* Allocate a small buffer that will hold a directive passed to sprintf or snprintf. */ buf_neededlength = xsum4 (7, d.max_width_length, d.max_precision_length, 6); #if HAVE_ALLOCA if (buf_neededlength < 4000 / sizeof (CHAR_T)) { buf = (CHAR_T *) alloca (buf_neededlength * sizeof (CHAR_T)); buf_malloced = NULL; } else #endif { size_t buf_memsize = xtimes (buf_neededlength, sizeof (CHAR_T)); if (size_overflow_p (buf_memsize)) goto out_of_memory_1; buf = (CHAR_T *) malloc (buf_memsize); if (buf == NULL) goto out_of_memory_1; buf_malloced = buf; } if (resultbuf != NULL) { result = resultbuf; allocated = *lengthp; } else { result = NULL; allocated = 0; } length = 0; /* Invariants: result is either == resultbuf or == NULL or malloc-allocated. If length > 0, then result != NULL. */ /* Ensures that allocated >= needed. Aborts through a jump to out_of_memory if needed is SIZE_MAX or otherwise too big. */ #define ENSURE_ALLOCATION(needed) \ if ((needed) > allocated) \ { \ size_t memory_size; \ CHAR_T *memory; \ \ allocated = (allocated > 0 ? xtimes (allocated, 2) : 12); \ if ((needed) > allocated) \ allocated = (needed); \ memory_size = xtimes (allocated, sizeof (CHAR_T)); \ if (size_overflow_p (memory_size)) \ goto out_of_memory; \ if (result == resultbuf || result == NULL) \ memory = (CHAR_T *) malloc (memory_size); \ else \ memory = (CHAR_T *) realloc (result, memory_size); \ if (memory == NULL) \ goto out_of_memory; \ if (result == resultbuf && length > 0) \ memcpy (memory, result, length * sizeof (CHAR_T)); \ result = memory; \ } for (cp = format, i = 0, dp = &d.dir[0]; ; cp = dp->dir_end, i++, dp++) { if (cp != dp->dir_start) { size_t n = dp->dir_start - cp; size_t augmented_length = xsum (length, n); ENSURE_ALLOCATION (augmented_length); memcpy (result + length, cp, n * sizeof (CHAR_T)); length = augmented_length; } if (i == d.count) break; /* Execute a single directive. */ if (dp->conversion == '%') { size_t augmented_length; if (!(dp->arg_index == ARG_NONE)) abort (); augmented_length = xsum (length, 1); ENSURE_ALLOCATION (augmented_length); result[length] = '%'; length = augmented_length; } else { if (!(dp->arg_index != ARG_NONE)) abort (); if (dp->conversion == 'n') { switch (a.arg[dp->arg_index].type) { case TYPE_COUNT_SCHAR_POINTER: *a.arg[dp->arg_index].a.a_count_schar_pointer = length; break; case TYPE_COUNT_SHORT_POINTER: *a.arg[dp->arg_index].a.a_count_short_pointer = length; break; case TYPE_COUNT_INT_POINTER: *a.arg[dp->arg_index].a.a_count_int_pointer = length; break; case TYPE_COUNT_LONGINT_POINTER: *a.arg[dp->arg_index].a.a_count_longint_pointer = length; break; #ifdef HAVE_LONG_LONG case TYPE_COUNT_LONGLONGINT_POINTER: *a.arg[dp->arg_index].a.a_count_longlongint_pointer = length; break; #endif default: abort (); } } else { arg_type type = a.arg[dp->arg_index].type; CHAR_T *p; unsigned int prefix_count; int prefixes[2]; #if !USE_SNPRINTF size_t tmp_length; CHAR_T tmpbuf[700]; CHAR_T *tmp; /* Allocate a temporary buffer of sufficient size for calling sprintf. */ { size_t width; size_t precision; width = 0; if (dp->width_start != dp->width_end) { if (dp->width_arg_index != ARG_NONE) { int arg; if (!(a.arg[dp->width_arg_index].type == TYPE_INT)) abort (); arg = a.arg[dp->width_arg_index].a.a_int; width = (arg < 0 ? (unsigned int) (-arg) : arg); } else { const CHAR_T *digitp = dp->width_start; do width = xsum (xtimes (width, 10), *digitp++ - '0'); while (digitp != dp->width_end); } } precision = 6; if (dp->precision_start != dp->precision_end) { if (dp->precision_arg_index != ARG_NONE) { int arg; if (!(a.arg[dp->precision_arg_index].type == TYPE_INT)) abort (); arg = a.arg[dp->precision_arg_index].a.a_int; precision = (arg < 0 ? 0 : arg); } else { const CHAR_T *digitp = dp->precision_start + 1; precision = 0; while (digitp != dp->precision_end) precision = xsum (xtimes (precision, 10), *digitp++ - '0'); } } switch (dp->conversion) { case 'd': case 'i': case 'u': # ifdef HAVE_LONG_LONG if (type == TYPE_LONGLONGINT || type == TYPE_ULONGLONGINT) tmp_length = (unsigned int) (sizeof (unsigned long long) * CHAR_BIT * 0.30103 /* binary -> decimal */ * 2 /* estimate for FLAG_GROUP */ ) + 1 /* turn floor into ceil */ + 1; /* account for leading sign */ else # endif if (type == TYPE_LONGINT || type == TYPE_ULONGINT) tmp_length = (unsigned int) (sizeof (unsigned long) * CHAR_BIT * 0.30103 /* binary -> decimal */ * 2 /* estimate for FLAG_GROUP */ ) + 1 /* turn floor into ceil */ + 1; /* account for leading sign */ else tmp_length = (unsigned int) (sizeof (unsigned int) * CHAR_BIT * 0.30103 /* binary -> decimal */ * 2 /* estimate for FLAG_GROUP */ ) + 1 /* turn floor into ceil */ + 1; /* account for leading sign */ break; case 'o': # ifdef HAVE_LONG_LONG if (type == TYPE_LONGLONGINT || type == TYPE_ULONGLONGINT) tmp_length = (unsigned int) (sizeof (unsigned long long) * CHAR_BIT * 0.333334 /* binary -> octal */ ) + 1 /* turn floor into ceil */ + 1; /* account for leading sign */ else # endif if (type == TYPE_LONGINT || type == TYPE_ULONGINT) tmp_length = (unsigned int) (sizeof (unsigned long) * CHAR_BIT * 0.333334 /* binary -> octal */ ) + 1 /* turn floor into ceil */ + 1; /* account for leading sign */ else tmp_length = (unsigned int) (sizeof (unsigned int) * CHAR_BIT * 0.333334 /* binary -> octal */ ) + 1 /* turn floor into ceil */ + 1; /* account for leading sign */ break; case 'x': case 'X': # ifdef HAVE_LONG_LONG if (type == TYPE_LONGLONGINT || type == TYPE_ULONGLONGINT) tmp_length = (unsigned int) (sizeof (unsigned long long) * CHAR_BIT * 0.25 /* binary -> hexadecimal */ ) + 1 /* turn floor into ceil */ + 2; /* account for leading sign or alternate form */ else # endif if (type == TYPE_LONGINT || type == TYPE_ULONGINT) tmp_length = (unsigned int) (sizeof (unsigned long) * CHAR_BIT * 0.25 /* binary -> hexadecimal */ ) + 1 /* turn floor into ceil */ + 2; /* account for leading sign or alternate form */ else tmp_length = (unsigned int) (sizeof (unsigned int) * CHAR_BIT * 0.25 /* binary -> hexadecimal */ ) + 1 /* turn floor into ceil */ + 2; /* account for leading sign or alternate form */ break; case 'f': case 'F': # ifdef HAVE_LONG_DOUBLE if (type == TYPE_LONGDOUBLE) tmp_length = (unsigned int) (LDBL_MAX_EXP * 0.30103 /* binary -> decimal */ * 2 /* estimate for FLAG_GROUP */ ) + 1 /* turn floor into ceil */ + 10; /* sign, decimal point etc. */ else # endif tmp_length = (unsigned int) (DBL_MAX_EXP * 0.30103 /* binary -> decimal */ * 2 /* estimate for FLAG_GROUP */ ) + 1 /* turn floor into ceil */ + 10; /* sign, decimal point etc. */ tmp_length = xsum (tmp_length, precision); break; case 'e': case 'E': case 'g': case 'G': case 'a': case 'A': tmp_length = 12; /* sign, decimal point, exponent etc. */ tmp_length = xsum (tmp_length, precision); break; case 'c': # if defined HAVE_WINT_T && !WIDE_CHAR_VERSION if (type == TYPE_WIDE_CHAR) tmp_length = MB_CUR_MAX; else # endif tmp_length = 1; break; case 's': # ifdef HAVE_WCHAR_T if (type == TYPE_WIDE_STRING) { tmp_length = local_wcslen (a.arg[dp->arg_index].a.a_wide_string); # if !WIDE_CHAR_VERSION tmp_length = xtimes (tmp_length, MB_CUR_MAX); # endif } else # endif tmp_length = strlen (a.arg[dp->arg_index].a.a_string); break; case 'p': tmp_length = (unsigned int) (sizeof (void *) * CHAR_BIT * 0.25 /* binary -> hexadecimal */ ) + 1 /* turn floor into ceil */ + 2; /* account for leading 0x */ break; default: abort (); } if (tmp_length < width) tmp_length = width; tmp_length = xsum (tmp_length, 1); /* account for trailing NUL */ } if (tmp_length <= sizeof (tmpbuf) / sizeof (CHAR_T)) tmp = tmpbuf; else { size_t tmp_memsize = xtimes (tmp_length, sizeof (CHAR_T)); if (size_overflow_p (tmp_memsize)) /* Overflow, would lead to out of memory. */ goto out_of_memory; tmp = (CHAR_T *) malloc (tmp_memsize); if (tmp == NULL) /* Out of memory. */ goto out_of_memory; } #endif /* Construct the format string for calling snprintf or sprintf. */ p = buf; *p++ = '%'; if (dp->flags & FLAG_GROUP) *p++ = '\''; if (dp->flags & FLAG_LEFT) *p++ = '-'; if (dp->flags & FLAG_SHOWSIGN) *p++ = '+'; if (dp->flags & FLAG_SPACE) *p++ = ' '; if (dp->flags & FLAG_ALT) *p++ = '#'; if (dp->flags & FLAG_ZERO) *p++ = '0'; if (dp->width_start != dp->width_end) { size_t n = dp->width_end - dp->width_start; memcpy (p, dp->width_start, n * sizeof (CHAR_T)); p += n; } if (dp->precision_start != dp->precision_end) { size_t n = dp->precision_end - dp->precision_start; memcpy (p, dp->precision_start, n * sizeof (CHAR_T)); p += n; } switch (type) { #ifdef HAVE_LONG_LONG case TYPE_LONGLONGINT: case TYPE_ULONGLONGINT: *p++ = 'l'; /*FALLTHROUGH*/ #endif case TYPE_LONGINT: case TYPE_ULONGINT: #ifdef HAVE_WINT_T case TYPE_WIDE_CHAR: #endif #ifdef HAVE_WCHAR_T case TYPE_WIDE_STRING: #endif *p++ = 'l'; break; #ifdef HAVE_LONG_DOUBLE case TYPE_LONGDOUBLE: *p++ = 'L'; break; #endif default: break; } *p = dp->conversion; #if USE_SNPRINTF p[1] = '%'; p[2] = 'n'; p[3] = '\0'; #else p[1] = '\0'; #endif /* Construct the arguments for calling snprintf or sprintf. */ prefix_count = 0; if (dp->width_arg_index != ARG_NONE) { if (!(a.arg[dp->width_arg_index].type == TYPE_INT)) abort (); prefixes[prefix_count++] = a.arg[dp->width_arg_index].a.a_int; } if (dp->precision_arg_index != ARG_NONE) { if (!(a.arg[dp->precision_arg_index].type == TYPE_INT)) abort (); prefixes[prefix_count++] = a.arg[dp->precision_arg_index].a.a_int; } #if USE_SNPRINTF /* Prepare checking whether snprintf returns the count via %n. */ ENSURE_ALLOCATION (xsum (length, 1)); result[length] = '\0'; #endif for (;;) { size_t maxlen = allocated - length; int count = -1; #if USE_SNPRINTF int retcount = 0; # define SNPRINTF_BUF(arg) \ switch (prefix_count) \ { \ case 0: \ retcount = SNPRINTF (result + length, maxlen, buf, \ arg, &count); \ break; \ case 1: \ retcount = SNPRINTF (result + length, maxlen, buf, \ prefixes[0], arg, &count); \ break; \ case 2: \ retcount = SNPRINTF (result + length, maxlen, buf, \ prefixes[0], prefixes[1], arg, \ &count); \ break; \ default: \ abort (); \ } #else # define SNPRINTF_BUF(arg) \ switch (prefix_count) \ { \ case 0: \ count = sprintf (tmp, buf, arg); \ break; \ case 1: \ count = sprintf (tmp, buf, prefixes[0], arg); \ break; \ case 2: \ count = sprintf (tmp, buf, prefixes[0], prefixes[1],\ arg); \ break; \ default: \ abort (); \ } #endif switch (type) { case TYPE_SCHAR: { int arg = a.arg[dp->arg_index].a.a_schar; SNPRINTF_BUF (arg); } break; case TYPE_UCHAR: { unsigned int arg = a.arg[dp->arg_index].a.a_uchar; SNPRINTF_BUF (arg); } break; case TYPE_SHORT: { int arg = a.arg[dp->arg_index].a.a_short; SNPRINTF_BUF (arg); } break; case TYPE_USHORT: { unsigned int arg = a.arg[dp->arg_index].a.a_ushort; SNPRINTF_BUF (arg); } break; case TYPE_INT: { int arg = a.arg[dp->arg_index].a.a_int; SNPRINTF_BUF (arg); } break; case TYPE_UINT: { unsigned int arg = a.arg[dp->arg_index].a.a_uint; SNPRINTF_BUF (arg); } break; case TYPE_LONGINT: { long int arg = a.arg[dp->arg_index].a.a_longint; SNPRINTF_BUF (arg); } break; case TYPE_ULONGINT: { unsigned long int arg = a.arg[dp->arg_index].a.a_ulongint; SNPRINTF_BUF (arg); } break; #ifdef HAVE_LONG_LONG case TYPE_LONGLONGINT: { long long int arg = a.arg[dp->arg_index].a.a_longlongint; SNPRINTF_BUF (arg); } break; case TYPE_ULONGLONGINT: { unsigned long long int arg = a.arg[dp->arg_index].a.a_ulonglongint; SNPRINTF_BUF (arg); } break; #endif case TYPE_DOUBLE: { double arg = a.arg[dp->arg_index].a.a_double; SNPRINTF_BUF (arg); } break; #ifdef HAVE_LONG_DOUBLE case TYPE_LONGDOUBLE: { long double arg = a.arg[dp->arg_index].a.a_longdouble; SNPRINTF_BUF (arg); } break; #endif case TYPE_CHAR: { int arg = a.arg[dp->arg_index].a.a_char; SNPRINTF_BUF (arg); } break; #ifdef HAVE_WINT_T case TYPE_WIDE_CHAR: { wint_t arg = a.arg[dp->arg_index].a.a_wide_char; SNPRINTF_BUF (arg); } break; #endif case TYPE_STRING: { const char *arg = a.arg[dp->arg_index].a.a_string; SNPRINTF_BUF (arg); } break; #ifdef HAVE_WCHAR_T case TYPE_WIDE_STRING: { const wchar_t *arg = a.arg[dp->arg_index].a.a_wide_string; SNPRINTF_BUF (arg); } break; #endif case TYPE_POINTER: { void *arg = a.arg[dp->arg_index].a.a_pointer; SNPRINTF_BUF (arg); } break; default: abort (); } #if USE_SNPRINTF /* Portability: Not all implementations of snprintf() are ISO C 99 compliant. Determine the number of bytes that snprintf() has produced or would have produced. */ if (count >= 0) { /* Verify that snprintf() has NUL-terminated its result. */ if (count < maxlen && result[length + count] != '\0') abort (); /* Portability hack. */ if (retcount > count) count = retcount; } else { /* snprintf() doesn't understand the '%n' directive. */ if (p[1] != '\0') { /* Don't use the '%n' directive; instead, look at the snprintf() return value. */ p[1] = '\0'; continue; } else { /* Look at the snprintf() return value. */ if (retcount < 0) { /* HP-UX 10.20 snprintf() is doubly deficient: It doesn't understand the '%n' directive, *and* it returns -1 (rather than the length that would have been required) when the buffer is too small. */ size_t bigger_need = xsum (xtimes (allocated, 2), 12); ENSURE_ALLOCATION (bigger_need); continue; } else count = retcount; } } #endif /* Attempt to handle failure. */ if (count < 0) { if (!(result == resultbuf || result == NULL)) free (result); if (buf_malloced != NULL) free (buf_malloced); CLEANUP (); errno = EINVAL; return NULL; } #if !USE_SNPRINTF if (count >= tmp_length) /* tmp_length was incorrectly calculated - fix the code above! */ abort (); #endif /* Make room for the result. */ if (count >= maxlen) { /* Need at least count bytes. But allocate proportionally, to avoid looping eternally if snprintf() reports a too small count. */ size_t n = xmax (xsum (length, count), xtimes (allocated, 2)); ENSURE_ALLOCATION (n); #if USE_SNPRINTF continue; #endif } #if USE_SNPRINTF /* The snprintf() result did fit. */ #else /* Append the sprintf() result. */ memcpy (result + length, tmp, count * sizeof (CHAR_T)); if (tmp != tmpbuf) free (tmp); #endif length += count; break; } } } } /* Add the final NUL. */ ENSURE_ALLOCATION (xsum (length, 1)); result[length] = '\0'; if (result != resultbuf && length + 1 < allocated) { /* Shrink the allocated memory if possible. */ CHAR_T *memory; memory = (CHAR_T *) realloc (result, (length + 1) * sizeof (CHAR_T)); if (memory != NULL) result = memory; } if (buf_malloced != NULL) free (buf_malloced); CLEANUP (); *lengthp = length; if (length > INT_MAX) goto length_overflow; return result; length_overflow: /* We could produce such a big string, but its length doesn't fit into an 'int'. POSIX says that snprintf() fails with errno = EOVERFLOW in this case. */ if (result != resultbuf) free (result); errno = EOVERFLOW; return NULL; out_of_memory: if (!(result == resultbuf || result == NULL)) free (result); if (buf_malloced != NULL) free (buf_malloced); out_of_memory_1: CLEANUP (); errno = ENOMEM; return NULL; } } #undef SNPRINTF #undef USE_SNPRINTF #undef PRINTF_PARSE #undef DIRECTIVES #undef DIRECTIVE #undef CHAR_T #undef VASNPRINTF int vasprintf (char **resultp, const char *format, va_list args) { size_t length; char *result = vasnprintf (NULL, &length, format, args); if (result == NULL) return -1; *resultp = result; /* Return the number of resulting bytes, excluding the trailing NUL. If it wouldn't fit in an 'int', vasnprintf() would have returned NULL and set errno to EOVERFLOW. */ return length; } #endif /** \endcond */ //Doxygen ignore. apophenia-1.0+ds/cmd/000077500000000000000000000000001262736346100144755ustar00rootroot00000000000000apophenia-1.0+ds/cmd/Makefile.am000066400000000000000000000003101262736346100165230ustar00rootroot00000000000000#The programs bin_PROGRAMS = \ apop_text_to_db \ apop_db_to_crosstab \ apop_plot_query AM_CFLAGS = \ -I $(top_srcdir) \ $(GSL_CFLAGS) LDADD = \ $(top_builddir)/libapophenia.la \ $(GSL_LIBS) apophenia-1.0+ds/cmd/apop_db_to_crosstab.c000066400000000000000000000040621262736346100206510ustar00rootroot00000000000000/** \file Command line utility to convert a three-column table to a crosstab.*/ /*Copyright (c) 2005--2007, 2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include int main(int argc, char **argv){ int c; char verbose=0; char const *msg="Usage: %s [opts] dbname table_name rows columns data\n" "\n" "A command-line wrapper for the apop_db_to_crosstab function.\n" "See Apophenia's online documentation for that function for details and tricks.\n" "The default for the data column is a count [count(*)]\n" "The column is optional; leave it out if you want a single-dimensional crosstab.\n" "If you need a non-default data column but want a 1-D crosstab, use 1 as your column.\n" "\n" " -d\tdelimiter (default: )\n" " -v\tverbose: prints status info on stderr\n" " -v -v\tvery verbose: also print queries executed on stderr\n" " -h\tdisplay this help and exit\n" "\n"; apop_opts.verbose=0; //so don't print queries until -v -v. while ((c = getopt (argc, argv, "d:f:hv-")) != -1) if (c=='d') strcpy(apop_opts.output_delimiter,optarg); else if (c=='h'||c=='-') {printf(msg, argv[0]); exit(0);} else if (c=='v') { verbose++; apop_opts.verbose++; } Apop_stopif(optind+2 > argc, return 1, 0, "I need at least two arguments past the options: database table [optional rowcol] [optional columncol] [optional datacol]"); _Bool no_rowcol = optind+2 > argc; _Bool no_columncol = optind+3 > argc; _Bool no_datacol = optind+4 > argc; char *rowcol = no_rowcol ? "1" : argv[optind+2]; char *colcol = no_columncol ? "1" : argv[optind+3]; char *datacol = no_datacol ? NULL: argv[optind+4]; if (verbose){ fprintf(stderr, "database:%s\ntable: %s\nrow col: %s\ncol col:%s%s%s\n---------\n", argv[optind], argv[optind +1], rowcol, colcol, no_datacol ?"":"\ndata col:", datacol); } apop_db_open(argv[optind]); apop_data *m = apop_db_to_crosstab(argv[optind +1], rowcol, colcol, datacol); apop_data_print(m); } apophenia-1.0+ds/cmd/apop_plot_query.c000066400000000000000000000115021262736346100200620ustar00rootroot00000000000000/** \file Command line utility to take in a query and produce a plot of its output via Gnuplot. Copyright (c) 2006--2007 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include /** This convenience function will take in a \c gsl_vector of data and put out a histogram, ready to pipe to Gnuplot. \param data A \c gsl_vector holding the data. Do not pre-sort or bin; this function does that for you via apop_data_to_bins. \param bin_count The number of bins in the output histogram (if you send zero, I set this to \f$\sqrt(N)\f$, where \f$N\f$ is the length of the vector.) \param with The method for Gnuplot's plotting routine. Default is \c "boxes", so the gnuplot call will read plot '-' with boxes. The \c "lines" option is also popular, and you can add extra terms if desired, like "boxes linetype 3". */ void plot_histogram(gsl_vector *data, FILE *f, size_t bin_count, char *with){ Apop_stopif(!data, return, 0, "Input vector is NULL."); if (!with) with="impulses"; apop_data vector_as_data = (apop_data){.vector=data}; apop_data *histodata = apop_data_to_bins(&vector_as_data, .bin_count=bin_count, .close_top_bin='y'); apop_data_sort(histodata); apop_data_free(histodata->more); //the binspec. fprintf(f, "set key off ;\n" "plot '-' with %s\n", with); apop_data_print(histodata, .output_pipe=f); fprintf(f, "e\n"); fflush(f); apop_data_free(histodata); } char *plot_type = NULL; int histobins = 0; int histoplotting = 0; FILE *open_output(char *outfile, int sf){ FILE *f; if (sf && !strcmp (outfile, "-")) return stdout; if (sf && outfile){ f = fopen(outfile, "w"); Apop_stopif(!f, exit(0), 0, "Trouble opening %s.", outfile); return f; } f = popen("`which gnuplot` -persist", "w"); Apop_stopif(!f, exit(0), 0, "Trouble opening %s.", "gnuplot"); return f; } char *read_query(char *infile){ char in[1000]; char *q = malloc(10); q[0] = '\0'; FILE *inf = fopen(infile, "r"); Apop_stopif(!inf, exit(0), 0, "Trouble opening %s. Look into that.\n", infile); while(fgets(in, 1000, inf)){ q = realloc(q, strlen(q) + strlen(in) + 4); sprintf(q, "%s%s", q, in); } sprintf(q, "%s;\n", q); fclose(inf); return q; } gsl_matrix *query(char *d, char *q, int no_plot){ apop_db_open(d); apop_data *result = apop_query_to_data("%s", q); apop_db_close(0); Apop_stopif(!result && !no_plot, exit(2), 0, "Your query returned a blank table. Quitting."); Apop_stopif(result->error, exit(2), 0, "Error running your query. Quitting."); if (no_plot){ apop_data_show(result); exit(0); } return result->matrix; } void print_out(FILE *f, char *outfile, gsl_matrix *m){ if (!histoplotting){ fprintf(f,"plot '-' with %s\n", plot_type); apop_matrix_print(m, NULL, .output_type='p', .output_pipe=f); } else { Apop_col_v(&(apop_data){.matrix=m}, 0, v); plot_histogram(v, f, histobins, NULL); } if (outfile) fclose(f); } int main(int argc, char **argv){ int c; char *q = NULL, *d = NULL, *outfile = NULL; int sf = 0, no_plot = 0; const char* msg= "Usage: %s [opts] dbname query\n" "\n" "Runs a query, and pipes the output directly to gnuplot. Use -f to dump to stdout or a file.\n" " -d\tdatabase to use (mandatory)\n" " -q\tquery to run (mandatory or use -Q)\n" " -Q\tfile from which to read the query\t\t\n" " -n\tno plot: just run the query and display results to stdout\t\t\n" " -t\tplot type (points, bars, ...) (default: \"lines\")\n" " -H\tplot histogram with this many bins (e.g., -H100) (to let the system auto-select bin sizes, use -H0)\n" " -f\tfile to dump to. If -f- then use stdout (default: pipe to Gnuplot)\n" " -h\tdisplay this help and exit\n" "\n"; Apop_stopif(argc<2, return 1, 0, msg, argv[0]); while ((c = getopt (argc, argv, "ad:f:hH:nQ:q:st:-")) != -1) if (c=='f'){ outfile = strdup(optarg); sf++; } else if (c=='H'){ histoplotting = 1; histobins = atoi(optarg); } else if (c=='h'||c=='-') { printf(msg, argv[0]); return 0; } else if (c=='d') d = strdup(optarg); else if (c=='n') no_plot ++; else if (c=='Q') q = read_query(optarg); else if (c=='q') q = strdup(optarg); else if (c=='t') plot_type = strdup(optarg); if (optind == argc -2){ d = argv[optind]; q = argv[optind+1]; } else if (optind == argc-1) q = argv[optind]; Apop_stopif(!q, return 1, 0, "I need a query specified with -q.\n"); if (!plot_type) plot_type = strdup("lines"); FILE *f = open_output(outfile, sf); gsl_matrix *m = query(d, q, no_plot); print_out(f, outfile, m); } apophenia-1.0+ds/cmd/apop_text_to_db.c000066400000000000000000000073771262736346100200310ustar00rootroot00000000000000/** \file A command line script to read a text file into a database. Copyright (c) 2006--2007, 2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ #include "apop_internal.h" #include int *break_down(char *in){ int *out = NULL; int ctr = 0; char *cp = strtok (in, ","); while (cp != NULL) { out = realloc(out, sizeof(int)*(ctr+1)); out[ctr++] = atoi(cp); cp = strtok (NULL, ","); } return out; } int main(int argc, char **argv){ int c; char *msg; int colnames = 'y', rownames = 0, tab_exists_check = 0; char **field_names = NULL; Asprintf(&msg, "Usage: %s [-d delimiters] text_file table_name dbname\n" "\n" "If the input text file name is a single dash, -, then read from STDIN.\n" "Input must be plain ASCII or UTF-8.\n" " -d\t\tthe single-character delimiters to use, e.g., -d \" ,\" or -d \"\\t\" (which you \n" " \t\t\twill almost certainly have to write as -d \"\\\\t\") (default: \"|,\\t\", meaning \n" " \t\t\tthat any of a pipe, comma, or tab will delimit separate entries)\n" " -nc\t\tdata does not include column names\n" " -n regex\t\tcase-insensitive regular expression indicating Null values (default: NaN)\n" " -m\t\tuse a MySQL database (default: SQLite)\n" " -f\t\tfixed width field ends: -f\"3,8,12,17\" (first char is one, not zero)\n" " -u\t\tmysql username\n" " -p\t\tmysql password\n" " -r\t\tdata includes row names\n" " -v\t\tverbosity\n" " -N\t\ta comma-separated list of column names: -N\"apple,banana,carrot,durian\"\n" " -en\t\tif table exists, do nothing and exit\n" " -ed\t\tif table exists, retain the table, delete all data, refill with the new data (i.e., call 'delete * from your_table')\n" " -eo\t\tif table exists, overwrite the table from scratch (deleting the previous table entirely)\n" " -ea\t\tif table exists, append new data to the existing table\n" " -h\t\tdisplay this help and exit\n" "\n" , argv[0]); int * field_list = NULL; char if_exists = 'n'; if(argc<3){ printf("%s", msg); return 0; } while ((c = getopt (argc, argv, "n:d:e:f:hmp:ru:vN:O")) != -1) if (c=='n') { if (optarg[0]=='c') colnames='n'; else apop_opts.nan_string = optarg; } else if (c=='N') { apop_data *field_name_data; apop_regex(optarg, " *([^,]*[^ ]) *(,|$) *", &field_name_data); Apop_stopif(!field_name_data, return 1, 0, "'%s' should be a " "comma-delimited list of field names, but I had trouble " "parsing it as such.", optarg); apop_data_transpose(field_name_data); field_names = field_name_data->text[0]; } else if (c=='d') strcpy(apop_opts.input_delimiters, optarg); else if (c=='f') field_list = break_down(optarg); else if (c=='h') {printf("%s", msg); return 0;} else if (c=='m') apop_opts.db_engine = 'm'; else if (c=='u') strcpy(apop_opts.db_user, optarg); else if (c=='p') strcpy(apop_opts.db_pass, optarg); else if (c=='r') rownames++; else if (c=='v') apop_opts.verbose=2; else if (c=='O') tab_exists_check++; //deprecated as of December 2013. else if (c=='e') { if (optarg[0]=='n') if_exists='n'; //the default anyway. else if (optarg[0]=='d') if_exists='d'; else if (optarg[0]=='a') if_exists='a'; else if (optarg[0]=='o') {if_exists='o'; tab_exists_check++; } } apop_db_open(argv[optind + 2]); if (tab_exists_check) apop_table_exists(argv[optind+1],1); apop_query("begin"); apop_text_to_db(argv[optind], argv[optind+1], rownames, colnames, field_names, .field_ends=field_list, .if_table_exists=if_exists); apop_query("commit"); } apophenia-1.0+ds/configure.ac000066400000000000000000000046461262736346100162320ustar00rootroot00000000000000# Process this file with autoconf to produce a configure script. m4_define([m4_apop_version], [m4_esyscmd_s(date +%Y%m%d)]) #will switch to this soon. AC_PREREQ(2.60) AC_INIT([apophenia], [1.0], [fluffmail@f-m.fm]) AM_SILENT_RULES([yes]) AC_CONFIG_SRCDIR([apop_arms.c]) AC_CONFIG_AUX_DIR([build-aux]) AC_CONFIG_HEADER([config.h]) AC_CONFIG_MACRO_DIR([m4]) AM_INIT_AUTOMAKE AM_MAINTAINER_MODE # The normal /usr/local default confused too many people ##AC_PREFIX_DEFAULT([/usr]) # libtool: LT_INIT # check linker script support gl_LD_VERSION_SCRIPT # Checks for programs. AC_PROG_CC AC_PROG_CC_C99 ACX_PTHREAD AC_OPENMP # Checks for libraries. ## math library LT_LIB_M ## GNU Scientific Library (GSL) AX_PATH_GSL([1.12.0],[],[AC_MSG_ERROR(could not find required version of GSL)]) ## DataBase system libraries #### MySQL library AX_LIB_MYSQL #### SQLite3 library AX_LIB_SQLITE3 # Checks for header files. AC_FUNC_ALLOCA AC_HEADER_STDC AC_CHECK_HEADERS([float.h inttypes.h limits.h stddef.h stdint.h stdlib.h string.h unistd.h wchar.h]) # Checks for typedefs, structures, and compiler characteristics. AC_C_CONST AC_C_INLINE AC_TYPE_SIZE_T AC_STRUCT_TM AC_CHECK_TYPES([ptrdiff_t]) AX_C___ATTRIBUTE__ #Some versions of GCC support atomics iff OpenMP is off. export CFLAGS="$CFLAGS $OPENMP_CFLAGS" AC_RUN_IFELSE( [AC_LANG_SOURCE([int main(){ _Atomic(int) i; } ])] , [AC_SUBST([Autoconf_no_atomics], 0)] , [AC_SUBST([Autoconf_no_atomics], 1)] , [AC_SUBST([Autoconf_no_atomics], 1)] ) # Checks for library functions. AC_FUNC_MALLOC AC_FUNC_REALLOC AC_FUNC_STRTOD AC_CHECK_FUNCS([floor memset pow regcomp sqrt strcasecmp asprintf]) # Checks for tests tools AC_PATH_PROGS([BC],[bc],[/usr/bin/bc]) AC_PATH_PROGS([SQLITE3],[sqlite3],[/usr/bin/sqlite3]) # Run only the basic tests unless asked to run the full suite. AC_MSG_CHECKING([whether to run extended tests]) AC_ARG_ENABLE([extended-tests], [AS_HELP_STRING([--enable-extended-tests], [run numeric torture tests (may be slow)])], [enable_extended_tests="yes"], [enable_extended_tests="no"]) AC_MSG_RESULT([$enable_extended_tests]) AM_CONDITIONAL([EXTENDED_TESTS], [test "X$enable_extended_tests" != "Xno"]) AC_CONFIG_FILES([ apophenia.pc apop.h Makefile transform/Makefile model/Makefile cmd/Makefile tests/Makefile docs/doxygen.conf docs/Makefile eg/Makefile ]) AC_CONFIG_FILES([ tests/utilities_test ], [ chmod a+x tests/utilities_test ]) AC_OUTPUT ## ## eof apophenia-1.0+ds/docs/000077500000000000000000000000001262736346100146625ustar00rootroot00000000000000apophenia-1.0+ds/docs/Makefile.am000066400000000000000000000034351262736346100167230ustar00rootroot00000000000000 #The Doxygen documentation, which you'll have to call yourself (via make doc). GVZDOT = /usr/bin/dot default: ## adhoc html man: doc apophenia_CSOURCES = \ $(top_srcdir)/model/*.c \ $(top_srcdir)/transform/*.c echo #$(wildcard $(top_srcdir)/model/*.c) \ echo #$(wildcard $(top_srcdir)/transform/*.c) apophenia_DOTS = \ structs.dot apophenia_IMAGES = \ flake.gif \ search.gif \ right.png \ down.png apophenia_IMAGES_EXTRA = \ triangle.png \ model.png apophenia_JS = \ tree.js apophenia_IMAGES_GENERATED = \ structs.png model_doc.h: $(apophenia_CSOURCES) cat $^ | awk -f $(top_srcdir)/docs/make_model_doc.awk > $@ doc: documentation.h model_doc.h $(apophenia_IMAGES) $(apophenia_JS) $(apophenia_IMAGES_GENERATED) $(MKDIR_P) include sed -e 's/__attribute.*;/;/;s/extern //' $(top_builddir)/apop.h > include/apop.h doxygen doxygen.conf for f in $(apophenia_IMAGES) $(apophenia_JS) ; do \ test -f $(top_srcdir)/docs/$$f && cp $(top_srcdir)/docs/$$f html/ ;\ done cp $(apophenia_IMAGES_GENERATED) html/ sed -i -f $(top_srcdir)/docs/edit_outline.sed html/index.html html/outline.html sed -i -f $(top_srcdir)/docs/edit_group.sed html/group__models.html sed -i -f $(top_srcdir)/docs/edit_width.sed html/*.html $(abs_top_srcdir)/docs/adjust doc-clean: -rm -rf include html latex man install-html-local: doc cp -prd html $(DESTDIR)$(docdir) maintainer-clean-local: doc-clean CLEANFILES = \ missing_model_parts MAINTAINERCLEANFILES = \ model_doc.h \ doxygen.log EXTRA_DIST = \ adjust \ make_model_doc.awk \ $(apophenia_DOTS) \ $(apophenia_IMAGES) \ $(apophenia_IMAGES_EXTRA) \ $(apophenia_JS) \ edit_outline.sed edit_globals.sed edit_group.sed edit_width.sed \ apop_data_fig.html head.html foot.html \ typical.css \ documentation.h %.png : %.dot $(GVZDOT) -Tpng < $< > $@ apophenia-1.0+ds/docs/adjust000077500000000000000000000061241262736346100161050ustar00rootroot00000000000000#Use GNU sed to make modifications to the LaTeX version before producing a PDF #But first, some HTML tweaks: # Background color of side pane should match rest of document sed -i "/background-color/ s|:.*|: #FFFFDE;|" html/navtree.css sed -i "/nav_h.png/ d" html/navtree.css # We no longer live in an ASCII-only world. sed -i 's/pdflatex/xelatex/' latex/Makefile #PDF: a 170pp document shouldn't be typeset in sans serif sed -i "/familydefault.*sfdefault/ d" latex/refman.tex #No fancyheaders sed -i -e "/fancy/d" -e '/footrulewidth/d' -e "/Generated by Doxygen/ d" latex/refman.tex ##Set title, for headers and such. #sed '/begin.document/i #\\title[Apophenia]{Apophenia}' latex/refman.tex # Nicer lists. sed -i '/begin.document/a\ \\setlength\\itemsep{0pt} \\renewcommand\\labelitemi{$\\circ$} \\renewcommand\\labelitemii{$\\circ$} \\renewcommand\\labelitemiii{$\\cdot$}' latex/refman.tex #I can't make the index look not-stupid sed -i '/makeindex/d' latex/refman.tex sed -i '/makeindex/d' latex/Makefile #Fix enumerations -> model documentation sed -i -f edit_group.sed latex/group__models.tex #Rm header list sed -i -e '/subsection.*Enumer/d' -e'/begin{DoxyCompactItemize}/,/end{DoxyCompactItemize}/d' latex/group__models.tex ##sed -i -e 's/\\cline{.*}$//' -e's/begin{TabularN*C}.*/begin{tabular}{ll}/' -e's/end{TabularN*C}.*/end{tabular}/' latex/group__models.tex #sed -i -e's/\(TabularC}\){2}/\1{3}/' latex/group__models.tex #redundant with the struct defns. fdline=$((`grep -n 'subsection.Function Documentation.' latex/group__all__public.tex|sed 's/:.*//'`-1)) sed -i "/subsection.Typedef Documentation./,$fdline d" latex/group__all__public.tex #Unclutter the ToC by using section*=omit from ToC sed -i -e 's/subsection{Detailed Description}/subsection*{Detailed Description}/' -e 's/subsection{Field Documentation}/subsection*{Field Documentation}/' latex/*tex #Bare references to subpages #sed -i -e's|^\\hyperlink.*hypertarget.*subsection.*label{[^}]*}$||' latex/*tex sed -i -e's|^\\hyperlink{[^}]*}{[^}]*}$||' latex/*tex #the hyperlink insertion into sample code sometimes creates needless newlines with a #six-space indent. sed -i '$!N;s/\n \(\\hyperlink\)/\1/;P;D' latex/*tex #code looks too small to me. Go one step bigger. sed -i -e 's/scriptsize/footnotesize/' latex/doxygen.sty #paragraphs have a lot of space after the header; tighten #sed -i -e'/DoxyParagraph/,+15 s|\(end{list}\)|\1\\vspace{-10pt}|' latex/doxygen.sty #offset description labels by a few cm sed -i '$a\ \\renewcommand{\\descriptionlabel}[1]{\\hspace{-1.5cm}\\emph{#1}}' latex/doxygen.sty sed -i -e 's/xtab\*{\\linewidth}/xtab/g' -e 's/xtab\(ular\|\)/supertabular/g' -e 's/\\tablefirsthead.*}//' -e '/\\par%/d' latex/doxygen.sty #sed latex/doxygen.sty #Want search box in header sed -i -e '/position: *absolute/d' html/search/search.css sed -i -e 's/display: *block/display: inline/' html/search/search.css #magnifying glass is hard-coded, and span="left" covers left and middle of search box sed -i -e '/mag_sel.png/,+3d' html/*html #100% of the search results are doubled. Can't work out why. sed -i -e 's/\],\[[^]]\+\]/]/' html/search/*js apophenia-1.0+ds/docs/apop_data_fig.html000066400000000000000000000021201262736346100203200ustar00rootroot00000000000000
RownameVector Matrix TextWeights
"Steven"
"Sandra"
"Joe"
Outcome
1
0
1
Age Weight (kg) Height (cm)
32 65 175
41 61 165
40 73 181
Sex State
Male Alaska
Female Alabama
Male Alabama
1
3.2
2.4
apophenia-1.0+ds/docs/documentation.h000066400000000000000000004762101262736346100177160ustar00rootroot00000000000000/* Apophenia's narrative documentation Copyright (c) 2005--2013 by Ben Klemens. Licensed under the GPLv2; see COPYING. */ /** \mainpage Welcome Apophenia is an open statistical library for working with data sets and statistical models. It provides functions on the same level as those of the typical stats package (such as OLS, Probit, or singular value decomposition) but gives the user more flexibility to be creative in model-building. The core functions are written in C, but experience has shown them to be easy to bind to in Python/Julia/Perl/Ruby/&c. It is written to scale well, to comfortably work with gigabyte data sets, million-step simulations, or computationally-intensive agent-based models. The goods The library has been growing and improving since 2005, and has been downloaded well over 1e4 times. To date, it has over two hundred functions and macros to facilitate statistical computing, such as: \li OLS and family, discrete choice models like Probit and Logit, kernel density estimators, and other common models. \li Functions for transforming models (like Normal \f$\rightarrow\f$ truncated Normal) and combining models (produce the cross-product of that truncated Normal with three others, or use Bayesian updating to combine that cross-product prior with an OLS likelihood to produce a posterior distribution over the OLS parameters). \li Database querying and maintenance utilities. \li Data manipulation tools for splitting, stacking, sorting, and otherwise shunting data sets. \li Moments, percentiles, and other basic stats utilities. \li \f$t\f$-tests, \f$F\f$-tests, et cetera. \li Several optimization methods available for your own new models. \li It does not re-implement basic matrix operations or build yet another database engine. Instead, it builds upon the excellent GNU Scientific and SQLite libraries. MySQL/mariaDB is also supported. For the full list of macros, functions, and prebuilt models, check the index. Download Apophenia here. Most users will want to download the latest packaged version linked from the Download Apophenia here header. See the \ref setup page for detailed setup instructions, including how to use your package manager to install the Debian package. The documentation To start off, have a look at this \ref gentle "Gentle Introduction" to the library. The outline gives a more detailed narrative. The index lists every function in the library, with detailed reference information. Notice that the header to every page has a link to the outline and the index. To really go in depth, download or pick up a copy of Modeling with Data, which discusses general methods for doing statistics in C with the GSL and SQLite, as well as Apophenia itself. A Useful Algebraic System of Statistical Models (PDF) discusses some of the theoretical structures underlying the library. There is a wiki with some convenience functions, tips, and so on. Notable features Much of what Apophenia does can be done in any typical statistics package. The \ref apop_data element is much like an R data frame, for example, and there is nothing special about being able to invert a matrix or take the product of two matrices with a single function call (\ref apop_matrix_inverse and \ref apop_dot, respectively). Even more advanced features like Loess smoothing (\ref apop_loess) and the Fisher Exact Test (\ref apop_test_fisher_exact) are not especially Apophenia-specific. But here are some things that are noteworthy. \li It's a C library! You can build applications using Apophenia for the data-processing back-end of your program, and not worry about the overhead associated with scripting languages. For example, it is currently used in production for certain aspects of processing for the U.S. Census Bureau's American Community Survey. And the numeric routines in your favorite scripting language typically have a back-end in plain C; perhaps Apophenia can facilitate writing your next one. \li The \ref apop_model object allows for consistent treatment of distributions, regressions, simulations, machine learning models, and who knows what other sorts of models you can dream up. By transforming and combining existing models, it is easy to build complex models from simple sub-models. \li For example, the \ref apop_update function does Bayesian updating on any two well-formed models. If they are on the table of conjugates, that is correctly handled, and if they are not, an appropriate variant of MCMC produces an empirical distribution. The output is yet another model, from which you can make random draws, or which you can use as a prior for another round of Bayesian updating. Outside of Bayesian updating, the \ref apop_model_metropolis function is good for approximating other complex models. \li The maximum likelihood system combines several subsystems into one form: it will do a few flavors of conjugate gradient search, Nelder-Mead Simplex, Newton's Method, or Simulated Annealing. You pick the method by a setting attached to your model. If you want to use a method that requires derivatives and you don't have a closed-form derivative, the ML subsystem will estimate a numerical gradient for you. If you would like to do EM-style maximization (all but the first parameter are fixed, that parameter is optimized, then all but the second parameter are fixed, that parameter is optimized, ..., looping through dimensions until the change in objective across cycles is less than eps), add a settings group specifying the tolerance at which the cycle should stop: Apop_settings_add_group(your_model, apop_mle, .dim_cycle_tolerance=eps). \li The Iterative Proportional Fitting algorithm, \ref apop_rake, is best-in-breed, designed to handle large, sparse matrices. Contribute! \li Develop a new model object. \li Contribute your favorite statistical routine. \li Package Apophenia into an RPM, portage, cygwin package. \li Report bugs or suggest features. \li Write bindings for your preferred language. For example, here are a Perl wrapper and early versions of a Julia wrapper and an R wrapper which you could expand upon. If you're interested, write to the maintainer (Ben Klemens), or join the GitHub project. */ /** \page eg Some examples Here are a few pieces of sample code for testing your installation or to give you a sense of what code with Apophenia's tools looks like. Two data streams The sample program here is intended to show how one would integrate Apophenia into an existing program. For example, say that you are running a simulation of two different treatments, or say that two sensors are posting data at regular intervals. The goal is to gather the data in an organized form, and then ask questions of the resulting data set. Below, a thousand draws are made from the two processes and put into a database. Then, the data is pulled out, some simple statistics are compiled, and the data is written to a text file for inspection outside of the program. This program will compile cleanly with the sample \ref makefile. \include draw_to_db.c Run a regression See \ref gentle for an example of loading a data set and running a simple regression. A sequence of t-tests In \ref mapply "The section on map/apply", a new \f$t\f$-test on every row, with all operations acting on entire rows rather than individual data points: \include t_test_by_rows.c In the documentation for \ref apop_query_to_text, a program to list all the tables in an SQLite database. \include ls_tables.c Marginal distribution A demonstration of fixing parameters to create a marginal distribution, via \ref apop_model_fix_params \include fix_params.c */ /** \page setup Setting up \section cast The supporting cast To use Apophenia, you will need to have a working C compiler, the GSL (v1.7 or higher) and SQLite installed. mySQL/mariaDB is optional. \li Some readers may be unfamiliar with modern package managers and common methods for setting up a C development environment; see Appendix O of Modeling with Data for an introduction. \li Other pages in this documentation have a few more notes for \ref windows "Windows" users, including \ref mingw users. \li Install the basics using your package manager. E.g., try \code sudo apt-get install make gcc libgsl0-dev libsqlite3-dev \endcode or \code sudo yum install make gcc gsl-devel libsqlite3x-devel \endcode \li If you use a Debian-based system (including Ubuntu), try the version in Debian's Stretch distribution: \code #Add Stretch to your sources list sudo sed -i '$a deb http://httpredir.debian.org/debian stretch main deb-src http://httpredir.debian.org/debian stretch main' /etc/apt/sources.list sudo apt-get update #Get Apophenia sudo apt-get install apophenia-bin #Optional: remove Stretch from your sources list. sudo sed -i -e '$d' -e '$d' /etc/apt/sources.list \endcode Thanks to Jerome Benoit for getting the library packaged and adjusted to Debian standards. \li If the Debian package is not for you, Download Apophenia here. Once you have the library downloaded, compile it using \code tar xvzf apop*tgz && cd apophenia-0.999 ./configure && make && make check && sudo make install \endcode If you decide not to keep the library on your system, run sudo make uninstall from the source directory to remove it. \li If you need to install packages in your home directory because you don't have root permissions, see the \ref notroot page. \li A \ref makefile will help immensely when you want to compile your program. \li You can verify that your setup works by trying some \ref eg "sample programs". \subpage notroot \subpage makefile \subpage windows \li Those who would like to work on a cutting-edge copy of the source code can get the latest version by cutting and pasting the following onto the command line. If you follow this route, be sure to read the development README in the apophenia directory this command will create. \code git clone https://github.com/b-k/apophenia.git \endcode */ /** \page windows Windows \ref mingw users, see that page. If you have a choice, Cygwin is strongly recommended. The setup program is very self-explanatory. As a warning, it will probably take up >300MB on your system. You should install at least the following programs: \li autoconf/automake \li binutils \li gcc \li gdb \li gnuplot -- for plotting data \li groff -- needed for the man program, below \li gsl -- the engine that powers Apophenia \li less -- to read text files \li libtool -- needed for compiling programs \li make \li man -- for reading help files \li more -- not as good as less but still good to have \li sqlite3 -- a simple database engine, a requisite for Apophenia If you are missing anything else, the program will probably tell you. The following are not necessary but are good to have on hand as long as you are going to be using Unix and programming. \li git -- to partake in the versioning system \li emacs -- steep learning curve, but people love it \li ghostscript (for reading .ps/.pdf files) \li openssh -- needed for git \li perl, python, ruby -- these are other languages that you might also be interested in \li tetex -- write up your documentation using the nicest-looking formatter around \li X11 -- a windowing system X-Window will give you a nicer environment in which to work. After you start Cygwin, type startx to bring up a more usable, nice-looking terminal (and the ability to do a few thousand other things which are beyond the scope of this documentation). Once you have Cygwin installed and a good terminal running, you can follow along with the remainder of the discussion without modification. Some older versions of Cygwin have a \c search.h file which doesn't include the function lsearch(). If this is the case on your system, you will have to update your Cygwin installation. Finally, windows compilers often spit out lines like: \code Info: resolving _gsl_rng_taus by linking to __imp__gsl_rng_taus (auto-import) \endcode These lines are indeed just information, and not errors. Feel free to ignore them. [Thanks to Andrew Felton and Derrick Higgins for their Cygwin debugging efforts.] \subpage mingw */ /** \page notroot Not root? If you aren't root, then the common procedure for installing a library is to create a subdirectory in your home directory in which to install packages. The key is the --prefix addition to the ./configure command. \code export MY_LIBS = myroot #choose a directory name to be created in your home directory. mkdir $HOME/$MY_LIBS # From Apophenia's package directory: ./configure --prefix $HOME/$MY_LIBS make make install #Now you don't have to be root. # Adjust your paths so the compiler and the OS can find the library. # These are environment variables, and they are usually set in the # shell's startup files. I assume you are using bash here. echo "export PATH=$HOME/$MY_LIBS/include:\$PATH" >> ~/.bashrc echo "export CPATH=$HOME/$MY_LIBS/include:\$CPATH" >> ~/.bashrc echo "export LIBRARY_PATH=$HOME/$MY_LIBS:\$LIBRARY_PATH" >> ~/.bashrc echo "export LD_LIBRARY_PATH=$HOME/$MY_LIBS:\$LD_LIBRARY_PATH" >> ~/.bashrc \endcode Once you have created this local root directory, you can use it to install as many new libraries as desired, and your paths will already be set up to find them. */ /** \page makefile Makefile Instead of giving lengthy compiler commands at the command prompt, you can use a Makefile to do most of the work. How to: \li Copy and paste the following into a file named \c makefile. \li Change the first line to the name of your program (e.g., if you have written census.c, then the first line will read PROGNAME=census). \li If your program has multiple .c files, add a corresponding .o to the currently blank objects variable, e.g. objects=sample2.o sample3.o \li One you have a Makefile in the directory, simply type make at the command prompt to generate the executable. \code PROGNAME = your_program_name_here objects = CFLAGS = -g -Wall -O3 LDLIBS = -lapophenia -lgsl -lgslcblas -lsqlite3 $(PROGNAME): $(objects) \endcode \li If your system has \c pkg-config, then you can use it for a slightly more robust and readable makefile. Replace the above C and link flags with: \code CFLAGS = -g -Wall `pkg-config --cflags apophenia` -O3 LDLIBS = `pkg-config --libs apophenia` \endcode The \c pkg-config program will then fill in the appropriate directories and libraries. Pkg-config knows Apophenia depends on the GSL and database libraries, so you need only list the most-dependent library. \li The -O3 flag is optional, asking the compiler to run its highest level of optimization (for speed). \li GCC users may need the --std=gnu99 or --std=gnu11 flag to use post-1989 C standards. \li Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, gcc sample.c -lapophenia will work, while gcc -lapophenia sample.c is likely to give you errors. Similarly, list -lapophenia before -lgsl, which comes before -lgslcblas. */ /** \page designated Designated initializers Functions so marked in this documentation use standard C designated initializers and compound literals to allow you to omit, call by name, or change the order of inputs. The following examples are all equivalent. The standard format: \code apop_text_to_db("infile.txt", "intable", 0, 1, NULL); \endcode Omitted arguments are left at their default vaules: \code apop_text_to_db("infile.txt", "intable"); \endcode You can use the variable's name, if you forget its ordering: \code apop_text_to_db("infile.txt", "intable", .has_col_name=1, .has_row_name=0); \endcode If an un-named element follows a named element, then that value is given to the next variable in the standard ordering: \code apop_text_to_db("infile.txt", "intable", .has_col_name=1, NULL); \endcode \li There may be cases where you can not use this form (it relies on a macro, which may not be available). You can always call the underlying function directly, by adding \c _base to the name and giving all arguments: \code apop_text_to_db_base("infile.txt", "intable", 0, 1, NULL); \endcode \li If one of the optional elements is an RNG and you do not provide one, I use one from \ref apop_rng_get_thread. */ /** \page preliminaries Getting started If you are entirely new to Apophenia, \ref gentle "have a look at the Gentle Introduction here". As well as the information in this outline, there is a separate page covering the details of \ref setup "setting up a computing environment" and another page with \ref eg "some sample code" for your perusal. \subpage gentle \subpage setup \subpage eg \subpage refstatusext */ /** \page refstatusext References and extensions \section mwd The book version Apophenia co-evolved with Modeling with Data: Tools and Techniques for Statistical Computing. You can read about the book, or download a free PDF copy of the full text, at modelingwithdata.org. As with many computer programs, the preferred manner of citing Apophenia is to cite its related book. Here is a BibTeX-formatted entry giving the relevant information: \code @book{klemens:modeling, title = "Modeling with Data: Tools and Techniques for Statistical Computing", author="Ben Klemens", year=2008, publisher="Princeton University Press" } \endcode The rationale for the \ref apop_model struct, based on an algebraic system of models, is detailed in a U.S. Census Bureau research report: \code @techreport{klemens:algebra, title = "A Useful Algebraic System of Statistical Models", author="Ben Klemens", month=jul, year=2014, institution="U.S.\ Census Bureau", number="06" } \endcode \section ext How do I write extensions? The system is written to not require a registration or initialization step to add a new model or other such parts. Just write your code and include it like any other C code. A new \ref apop_model has to conform to some rules if it is to play well with \ref apop_estimate, \ref apop_draw, and so forth. See the notes at \ref modeldetails. Once your new model or function is working, please post the code or a link to the code on the Apophenia wiki. \subpage c \section links Further references For your convenience, here are links to some other libraries you are probably using. \li The standard C library \li The GSL documentation, and its index \li SQL understood by SQLite */ /** \page outline An outline of the library The narrative in this section goes into greater detail on how to use the components of Apophenia. You are encouraged to read \ref gentle first. This overview begins with the \ref apop_data set, which is the central data structure used throughout the system. Section \ref dbs covers the use of the database interface, because there are a lot of things that a database will do better than a matrix structure like the \ref apop_data struct. Section \ref modelsec covers statistical models, in the form of the \ref apop_model structure. This part of the system is built upon the \ref apop_data set to hold parameters, statistics, data sets, and so on. \ref Histosec covers probability mass functions, which are statistical models built directly around a data set, where the chance of drawing a given observation is proportional to how often that observation appears in the source data. There are many situations where one would want to treat a data set as a probability distribution, such as using \ref apop_kl_divergence to find the information loss from an observed data set to a theoretical model fit to that data. Section \ref testpage covers traditional hypothesis testing, beginning with common statistics that take an \ref apop_data set or two as input, and continuing on to generalized hypothesis testing for any \ref apop_model. Because estimation in the \ref apop_model relies heavily on maximum likelihood estimation, Apophenia's optimizer subsystem is extensive. \ref maxipage offers some additional notes on optimization and how it can be used in non-statistical contexts. \subpage dataoverview \subpage dbs \subpage modelsec \subpage Histosec \subpage testpage \subpage maxipage \subpage moreasst */ /** \page c C, SQL and coding utilities Learning C Modeling with Data has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are in abundance. Some people find pointers to be especially difficult; fortunately, there's a claymation cartoon which clarifies everything. Header aggregation There is only one header. Put \code #include \endcode at the top of your file, and you're done. Everything declared in that file starts with \c apop_ or \c Apop_. It also includes \c assert.h, \c math.h, \c signal.h, and \c string.h. Linking You will need to link to the Apophenia library, which involves adding the -lapophenia flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like: \code gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g -Wall -O3 \endcode Your best bet is to encapsulate this mess in a \ref makefile "Makefile". Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags. Standards compliance To the best of our abilities, Apophenia complies to the C standard (ISO/IEC 9899:2011). As well as relying on the GSL and SQLite, it uses some POSIX function calls, such as \c strcasecmp and \c popen. \subpage designated \section debugging Errors, logging, debugging and stopping The \c error element The \ref apop_data set and the \ref apop_model both include an element named \c error. It is normally \c 0, indicating no (known) error. For example, \ref apop_data_copy detects allocation errors and some circular links (when Data->more == Data) and fails in those cases. You could thus use the function with a form like \code apop_data *d = apop_text_to_data("indata"); apop_data *cp = apop_data_copy(d); if (cp->error) {printf("Couldn't copy the input data; failing.\n"); return 1;} \endcode There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the \c error element. E.g., \code apop_data *d = apop_text_to_data("indata"); apop_data *cp = apop_data_copy(d); if (cp->error == 'a') {printf("Couldn't allocate space for the copy; failing.\n"); return 1;} if (cp->error == 'c') {printf("Circular link in the data set; failing.\n"); return 2;} \endcode The end of Appendix O of Modeling with Data offers some GDB macros which can make dealing with Apophenia from the GDB command line much more pleasant. As discussed below, it also helps to set apop_opts.stop_on_warning='v' or 'w' when running under the debugger. \section verbsec Verbosity level and logging The global variable apop_opts.verbose determines how many notifications and warnings get printed by Apophenia's warning mechanism: -1: turn off logging, print nothing (ill-advised)
0: notify only of failures and clear danger
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability
2: debugging-type information; print queries
3: give me everything, such as the state of the data at each iteration of a loop. These levels are of course subjective, but should give you some idea of where to place the verbosity level. The default is 1. The messages are printed to the \c FILE* handle at apop_opts.log_file. If this is blank (which happens at startup), then this is set to \c stderr. This is the typical behavior for a console program. Use \code apop_opts.log_file = fopen("mylog", "w"); \endcode to write to the \c mylog file instead of \c stderr. As well as the error and warning messages, some functions can also print diagnostics, using the \ref Apop_notify macro. For example, \ref apop_query and friends will print the query sent to the database engine iff apop_opts.verbose >=2 (which is useful when building complex queries). The diagnostics attempt to follow the same verbosity scale as the warning messages. \section Stopping By default, warnings and errors never halt processing. It is up to the calling function to decide whether to stop. When running the program under a debugger, this is an annoyance: we want to stop as soon as a problem turns up. The global variable apop_opts.stop_on_warning changes when the system halts: \c 'n': never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.
The default: if the variable is '\0' (the default), halt on severe errors, continue on all warnings.
\c 'v': If the verbosity level of the warning is such that the warning would print to screen, then halt; if the warning message would be filtered out by your verbosity level, continue.
\c 'w': Halt on all errors or warnings, including those below your verbosity threshold. See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted. \section Legi Legible output The output routines handle four sinks for your output. There is a global variable that you can use for small projects where all data will go to the same place. \code apop_opts.output_type = 'f'; //named file apop_opts.output_type = 'p'; //a pipe or already-opened file apop_opts.output_type = 'd'; //the database \endcode You can also set the output type, the name of the output file or table, and other options via arguments to individual calls to output functions. See \ref apop_prep_output for the list of options. C makes minimal distinction between pipes and files, so you can set a pipe or file as output and send all output there until further notice: \code apop_opts.output_type = 'p'; apop_opts.output_pipe = popen("gnuplot", "w"); apop_plot_lattice(...); //see https://github.com/b-k/Apophenia/wiki/gnuplot_snippets fclose(apop_opts.output_pipe); apop_opts.output_pipe = fopen("newfile", "w"); apop_data_print(set1); fprintf(apop_opts.output_pipe, "\nNow set 2:\n"); apop_data_print(set2); \endcode Continuing the example, you can always override the global data with a specific request: \code apop_vector_print(v, "vectorfile"); //put vectors in a separate file apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db apop_matrix_print(m, .output_pipe = stdout); //now show the same matrix on screen \endcode I will first look to the input file name, then the input pipe, then the global \c output_pipe, in that order, to determine to where I should write. Some combinations (like output type = \c 'd' and only a pipe) don't make sense, and I'll try to warn you about those. What if you have too much output and would like to use a pager, like \c less or \c more? In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is the form: \code FILE *lesspipe = popen("less", "w"); assert(lesspipe); apop_data_print(your_data_set, .output_pipe=lesspipe); pclose(lesspipe); \endcode \c popen will search your usual program path for \c less, so you don't have to give a full path. \li\ref apop_data_print \li\ref apop_matrix_print \li\ref apop_vector_print \section sqlsec About SQL, the syntax for querying databases For a reference, your best bet is the Structured Query Language reference for SQLite. For a tutorial; there is an abundance of tutorials online. Here is a nice blog entry about complementaries between SQL and matrix manipulation packages. Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases. The global apop_opts.db_engine is initially \c NULL, indicating no preference for a database engine. You can explicitly set it: \code apop_opts.db_engine='s' //use SQLite apop_opts.db_engine='m' //use mySQL/mariaDB \endcode If \c apop_opts.db_engine is still \c NUL on your first database operation, then I will check for an environment variable APOP_DB_ENGINE, and set apop_opts.db_engine='m' if it is found and matches (case insensitive) \c mariadb or \c mysql. \code export APOP_DB_ENGINE=mariadb apop_text_to_db indata mtab db_for_maria unset APOP_DB_ENGINE apop_text_to_db indata stab db_for_sqlite.db \endcode Write \ref apop_data sets to the database using \ref apop_data_print, with .output_type='d'. \li Column names are inserted if there are any. If there are, all dots are converted to underscores. Otherwise, the columns will be named \c c1, \c c2, \c c3, &c. \li If \ref apop_opts_type "apop_opts.db_name_column" is not blank (the default is "row_name"), then a so-named column is created, and the row names are placed there. \li If there are weights, they will be the last column of the table, and the column will be named \c weights. \li If the table does not exist, create it. Use \ref apop_data_print (data, "tabname", .output_type='d', .output_append='w') to overwrite an existing table or with .output_append='a' to append. Appending is the default. Or, call \ref apop_table_exists ("tabname", 'd') to ensure that the table is removed ahead of time. \li If your data set has zero data (i.e., is just a list of column names or is entirely blank), \ref apop_data_print returns without creating anything in the database. \li Especially if you are using a pre-2007 version of SQLite, there may be a speed gain to wrapping the call to this function in a begin/commit pair: \code apop_query("begin;"); apop_data_print(dataset, .output_name="dbtab", .output_type='d'); apop_query("commit;"); \endcode Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see \ref db_moments. \section threads Threading Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works to use Apophenia, and many points of work will thread without your doing anything. \li All functions strive to be thread-safe. Part of how this is achieved is that static variables are marked as thread-local or atomic, as per the C standard. There still exist compilers that can't implement thread-local or atomic variables, in which case your safest bet is to set OMP's thread count to one as below (or get a new compiler). \li Some functions modify their inputs. It is up to you to use those functions in a thread-safe manner. The \ref apop_matrix_realloc handles states and global variables correctly in a threaded environment, but if you have two threads resizing the same \c gsl_matrix at the same time, you're going to have problems. \li There are few compilers that don't support OpenMP. When compiling on such a system all work will be single-threaded. \li Set the maximum number of threads to \c N with the environment variable \code export OMP_NUM_THREADS N \endcode or the C function \code #include omp_set_num_threads(N); \endcode Use one of these methods with N=1 if you want a single-threaded program. You can return later to using all available threads via omp_set_num_threads(omp_get_num_procs()). \li \ref apop_map and friends distribute their \c for loop over the input \ref apop_data set across multiple threads. Therefore, be careful to send thread-unsafe functions to it only after calling \c omp_set_num_threads(1). \li There are a few functions, like \ref apop_model_draws, that rely on \ref apop_map, and therefore also thread by default. \li The function \ref apop_rng_get_thread retrieves a statically-stored RNG specific to a given thread. Therefore, if you use that function in the place of a \c gsl_rng, you can parallelize functions that make random draws. \li \ref apop_rng_get_thread allocates its store of threads using apop_opts.rng_seed, then incrementing that seed by one. You thus probably have threads with seeds 479901, 479902, 479903, .... [If you have a better way to do it, please feel free to modify the code to implement your improvement and submit a pull request on Github.] See this tutorial on C threading if you would like to know more, or are unsure about whether your functions are thread-safe or not. */ /** \page dataoverview Data sets The \ref apop_data structure represents a data set. It joins together a \c gsl_vector, a \c gsl_matrix, an \ref apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a \c gsl_matrix or a \c gsl_vector. Here is a diagram showing a sample data set with all of the elements in place. Together, they represent a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named. \htmlinclude apop_data_fig.html \latexinclude apop_data_fig.tex In a regression, the vector would be the dependent variable, and the other columns (after factor-izing the text) the independent variables. Or think of the \ref apop_data set as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some sample code to print the vector and matrix, starting at column -1 (but you can use \ref apop_data_print to do this). \code for (int j = 0; j< data->matrix->size1; j++){ printf("%s\t", apop_name_get(data->names, j, 'r')); for (int i = -1; i< data->matrix->size2; i++) printf("%g\t", apop_data_get(data, j, i)); printf("\n"); } \endcode Most functions assume that each row represents one observation, so the data vector, data matrix, and text have the same row count: \c data->vector->size==data->matrix->size1 and \c data->vector->size==*data->textsize. This means that the \ref apop_name structure doesn't have separate \c vector_names, \c row_names, or \c text_row_names elements: the \c rownames are assumed to apply for all. See below for notes on managing the \c text element and the row/column names. \section pps Pages The \ref apop_data set includes a \c more pointer, which will typically be \c NULL, but may point to another \ref apop_data set. This is intended for a main data set and a second or third page with auxiliary information, such as estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two. The \c more pointer is not intended for a linked list for millions of data points. In such situations, you can often improve efficiency by restructuring your data to use a single table (perhaps via \ref apop_data_pack and \ref apop_data_unpack). Most functions, such as \ref apop_data_copy and \ref apop_data_free, will handle all the pages of information. For example, an optimization search over multi-page parameter sets would search the space given by all pages. Pages may also be appended as output or auxiliary information, such as covariances, and an MLE would not search over these elements. Any page with a name in XML-ish brackets, such as \, is considered information about the data, not data itself, and therefore ignored by search routines, missing data routines, et cetera. This is achieved by a rule in \ref apop_data_pack and \ref apop_data_unpack. Here is a toy example that establishes a baseline data set, adds a page, modifies it, and then later retrieves it. \code apop_data *d = apop_data_alloc(10, 10, 10); //the base data set, a 10-item vector + 10x10 matrix apop_data *a_new_page = apop_data_add_page(d, apop_data_alloc(2,2), "new 2 x 2 page"); gsl_vector_set_all(a_new_page->matrix, 3); //later: apop_data *retrieved = apop_data_get_page(d, "new", 'r'); //'r'=search via regex, not literal match. apop_data_print(retrieved); //print a 2x2 grid of 3s. \endcode \section datafns Functions for using apop_data sets There are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the \ref apop_data structure and its components. \li\ref apop_data_add_named_elmt \li\ref apop_data_copy \li\ref apop_data_fill \li\ref apop_data_memcpy \li\ref apop_data_pack \li\ref apop_data_rm_columns \li\ref apop_data_sort \li\ref apop_data_split \li\ref apop_data_stack \li\ref apop_data_transpose : transpose matrices (square or not) and text grids \li\ref apop_data_unpack \li\ref apop_matrix_copy \li\ref apop_matrix_realloc \li\ref apop_matrix_stack \li\ref apop_text_set \li\ref apop_text_paste \li\ref apop_text_to_data \li\ref apop_vector_copy \li\ref apop_vector_fill \li\ref apop_vector_stack \li\ref apop_vector_realloc \li\ref apop_vector_unique_elements Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. Meanwhile, here are prototypes for a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient to indicate how they are used. \li gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j) \li gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j) \li gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j) \li gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src) \li gsl_matrix_transpose (gsl_matrix * m) : square matrices only \li gsl_matrix_set_all (gsl_matrix * m, double x) \li gsl_matrix_set_zero (gsl_matrix * m) \li gsl_matrix_set_identity (gsl_matrix * m) \li gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src) \li void gsl_vector_set_all (gsl_vector * v, double x) \li void gsl_vector_set_zero (gsl_vector * v) \li int gsl_vector_set_basis (gsl_vector * v, size_t i): set all elements to zero, but set item \f$i\f$ to one. \li gsl_vector_reverse (gsl_vector * v): reverse the order of your vector's elements \li gsl_vector_ptr and gsl_matrix_ptr. To increment an element in a vector use, e.g., *gsl_vector_ptr(v, 7) += 3; or (*gsl_vector_ptr(v, 7))++. \li gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src) \subsection readin Reading from text files The \ref apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See \ref text_format for the full range and details of what can be read in. If you have any columns of text, then you will need to read in via the database: use \ref apop_text_to_db() to convert your text file to a database table, do any database-appropriate cleaning of the input data, then use \ref apop_query_to_data() or \ref apop_query_to_mixed_data() to pull the data to an \ref apop_data set. \subpage text_format \section datalloc Alloc/free You may not need to use these functions often, given that \ref apop_query_to_data, \ref apop_text_to_data, and many transformation functions will auto-allocate \ref apop_data sets for you. The \ref apop_data_alloc function allocates a vector, a matrix, or both. After this call, the structure will have blank names, \c NULL \c text element, and \c NULL \c weights. See \ref names for discussion of filling the names. Use \ref apop_text_alloc to allocate the \c text grid. The \c weights are a simple \c gsl_vector, so allocate a 100-unit weights vector via allocated_data_set->weights = gsl_vector_alloc(100). Examples of use can be found throughout the documentation; for example, see \ref gentle. \li\ref apop_data_alloc \li\ref apop_data_calloc \li\ref apop_data_free \li\ref apop_text_alloc : allocate or resize the text part of an \ref apop_data set. \li\ref apop_text_free See also: \li gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2) \li gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2) \li void gsl_matrix_free (gsl_matrix * m) \li gsl_vector * gsl_vector_alloc (size_t n) \li gsl_vector * gsl_vector_calloc (size_t n) \li void gsl_vector_free (gsl_vector * v) \section gslviews Using views There are several macros for the common task of viewing a single row or column of a \ref apop_data set. \code apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table"); //Get a column using its name. Note that the generated view, ov, is the //last item named in the call to the macro. Apop_col_t(d, "obs1", ov); double obs1_sum = apop_vector_sum(ov); //Get row zero of the data set's matrix as a vector; get its sum double row_zero_sum = apop_vector_sum(Apop_rv(d, 0)); //Get a row or rows as a standalone one-row apop_data set apop_data_print(Apop_r(d, 0)); //ten rows starting at row 3: apop_data *d10 = Apop_rs(d, 3, 10); apop_data_print(d10); //Column zero's sum gsl_vector *cv = Apop_cv(d, 0); double col_zero_sum = apop_vector_sum(cv); //or one one line: double col_zero_sum = apop_vector_sum(Apop_cv(d, 0)); //Pull a 10x5 submatrix, whose origin element is the (2,3)rd //element of the parent data set's matrix double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5)); \endcode Because these macros can be used as arguments to a function, these macros have abbreviated names to save line space. \li\ref Apop_r : get row as one-observation \ref apop_data set \li\ref Apop_c : get column as \ref apop_data set \li\ref Apop_cv : get column as \ref gsl_vector \li\ref Apop_rv : get row as \ref gsl_vector \li\ref Apop_cs : get columns as \ref apop_data set \li\ref Apop_rs : get rows as \ref apop_data set \li\ref Apop_mcv : matrix column as vector \li\ref Apop_mrv : matrix row as vector \li\ref Apop_subm : get submatrix of a \ref gsl_matrix A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments. \li\ref Apop_col_t \li\ref Apop_row_t \li\ref Apop_col_tv \li\ref Apop_row_tv The view is an automatic variable, not a pointer, and therefore disappears at the end of the scope in which it is declared. If you want to retain the data after the function exits, copy it to another vector: \code return apop_vector_copy(Apop_rv(d, 2)); //return a gsl_vector copy of row 2 \endcode Curly braces always delimit scope, not just at the end of a function. When program evaluation exits a given block, all variables in that block are erased. Here is some sample code that won't work: \code apop_data *outdata; if (get_odd){ outdata = Apop_r(data, 1); } else { outdata = Apop_r(data, 0); } apop_data_print(outdata); //breaks: outdata points to out-of-scope variables. \endcode For this if/then statement, there are two sets of local variables generated: one for the \c if block, and one for the \c else block. By the last line, neither exists. You can get around the problem here by making sure to not put the macro declaring new variables in a block. E.g.: \code apop_data *outdata = Apop_r(data, get_odd ? 1 : 0); apop_data_print(outdata); \endcode \section data_set_get Set/get First, some examples: \code apop_data *d = apop_data_alloc(10, 10, 10); apop_name_add(d->names, "Zeroth row", 'r'); apop_name_add(d->names, "Zeroth col", 'c'); //set cell at row=8 col=0 to value=27 apop_data_set(d, 8, 0, .val=27); assert(apop_data_get(d, 8, .colname="Zeroth") == 27); double *x = apop_data_ptr(d, .col=7, .rowname="Zeroth"); *x = 270; assert(apop_data_get(d, 0, 7) == 270); // This is invalid---the value doesn't follow the colname. Use .val=5. // apop_data_set(d, .row = 3, .colname="Column 8", 5); // This is OK, to set (3, 8) to 5: apop_data_set(d, 3, 8, 5); //apop_data set holding a scalar: apop_data *s = apop_data_alloc(1); apop_data_set(s, .val=12); assert(apop_data_get(s) == 12); //apop_data set holding a vector: apop_data *v = apop_data_alloc(12); for (int i=0; i< 12; i++) apop_data_set(s, i, .val=i*10); assert(apop_data_get(s,3) == 30); //This is a common form from pulling from a list of named scalars, //produced via apop_data_add_named_elmt double AIC = apop_data_get(your_model->info, .rowname="AIC"); \endcode \li The versions that take a column/row name use \ref apop_name_find for the search; see notes there on the name matching rules. \li For those that take a column number, column -1 is the vector element. \li For those that take a column name, I will search the vector last---if I don't find the name among the matrix columns, but the name matches the vector name, I return column -1. \li If you give me both a .row and a .rowname, I go with the name; similarly for .col and .colname. \li You can give the name of a page, e.g. \code double AIC = apop_data_get(data, .rowname="AIC", .col=-1, .page=""); \endcode \li Numeric values default to zero, which is how the examples above that treated the \ref apop_data set as a vector or scalar could do so relatively gracefully. So apop_data_get(dataset, 1) gets item (1, 0) from the matrix element of \c dataset. But as a do-what-I-mean exception, if there is no matrix element but there is a vector, then this form will get vector element 1. Relying on this DWIM exception is useful iff you can guarantee that a data set will have only a vector or a matrix but not both. Otherwise, be explicit and use apop_data_get(dataset, 1, -1). The \ref apop_data_ptr function follows the lead of \c gsl_vector_ptr and \c gsl_matrix_ptr, and like those functions, returns a pointer to the appropriate \c double. For example, to increment the (3,7)th element of an \ref apop_data set: \code (*apop_data_ptr(dataset, 3, 7))++; \endcode \li\ref apop_data_get \li\ref apop_data_set \li\ref apop_data_ptr : returns a pointer to the element. \li\ref apop_data_get_page : retrieve a named page from a data set. If you only need a few items, you can specify a page name to \c apop_data_(get|set|ptr). See also: \li double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j) \li double gsl_vector_get (const gsl_vector * v, size_t i) \li void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x) \li void gsl_vector_set (gsl_vector * v, size_t i, double x) \li double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j) \li double * gsl_vector_ptr (gsl_vector * v, size_t i) \li const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j) \li const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i) \li gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i) \li gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j) \li gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v) \li gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v) \section mapply Map/apply \anchor outline_mapply These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The \c ..._sum functions return the sum of the mapped output. There are two types, which were developed at different times. The \ref apop_map and \ref apop_map_sum functions use variadic function inputs to cover a lot of different types of process depending on the inputs. Other functions with types in their names, like \ref apop_matrix_map and \ref apop_vector_apply, may be easier to use in some cases. They use the same routines internally, so use whichever type is convenient. You can do many things quickly with these functions. Get the sum of squares of a vector's elements: \code //given apop_data *dataset and gsl_vector *v: double sum_of_squares = apop_map_sum(dataset, gsl_pow_2); double sum_of_sqvares = apop_vector_map_sum(v, gsl_pow_2); \endcode Create an index vector [\f$0, 1, 2, ...\f$]. \code double index(double in, int index){return index;} apop_data *d = apop_map(apop_data_alloc(100), .fn_di=index, .inplace='y'); \endcode Given your log likelihood function, which acts on a \ref apop_data set with only one row, and a data set where each row of the matrix is an observation, find the total log likelihood via: \code static double your_log_likelihood_fn(apop_data * in) {[your math goes here]} double total_ll = apop_map_sum(dataset, .fn_r=your_log_likelihood_fn); \endcode How many missing elements are there in your data matrix? \code static double nan_check(const double in){ return isnan(in);} int missing_ct = apop_map_sum(in, nan_check, .part='m'); \endcode Get the mean of the not-NaN elements of a data set: \code static double no_nan_val(const double in){ return isnan(in)? 0 : in;} static double not_nan_check(const double in){ return !isnan(in);} static double apop_mean_no_nans(apop_data *in){ return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check); } \endcode The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the \f$t\f$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero. Notice how the older \ref apop_vector_apply uses file-global variables to pass information into the functions, while the \ref apop_map uses a pointer to send parameters to the functions. \include t_test_by_rows.c One more toy example demonstrating the use of \ref apop_map and \ref apop_map_sum : \include apop_map_row.c \li If the number of threads is greater than one, then the matrix will be broken into chunks and each sent to a different thread. Notice that the GSL is generally threadsafe, and SQLite is threadsafe conditional on several commonsense caveats that you'll find in the SQLite documentation. See \ref apop_rng_get_thread() to use the GSL's RNGs in a threaded environment. \li The \c ...sum functions are convenience functions that call \c ...map and then add up the contents. Thus, you will need to have adequate memory for the allocation of the temp matrix/vector. \li\ref apop_map \li\ref apop_map_sum \li\ref apop_matrix_apply \li\ref apop_matrix_map \li\ref apop_matrix_map_all_sum \li\ref apop_matrix_map_sum \li\ref apop_vector_apply \li\ref apop_vector_map \li\ref apop_vector_map_sum \section matrixmathtwo Basic Math \li\ref apop_vector_exp : exponentiate every element of a vector \li\ref apop_vector_log : take the natural log of every element of a vector \li\ref apop_vector_log10 : take the log (base 10) of every element of a vector \li\ref apop_vector_distance : find the distance between two vectors via various metrics \li\ref apop_vector_normalize : scale/shift a matrix to have mean zero, sum to one, have a range of exactly \f$[0, 1]\f$, et cetera \li\ref apop_vector_entropy : calculate the entropy of a vector of frequencies or probabilities See also: \li int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b) \li int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b) \li int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b) \li int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b) \li int gsl_matrix_scale (gsl_matrix * a, const double x) \li int gsl_matrix_add_constant (gsl_matrix * a, const double x) \li gsl_vector_add (gsl_vector * a, const gsl_vector * b) \li gsl_vector_sub (gsl_vector * a, const gsl_vector * b) \li gsl_vector_mul (gsl_vector * a, const gsl_vector * b) \li gsl_vector_div (gsl_vector * a, const gsl_vector * b) \li gsl_vector_scale (gsl_vector * a, const double x) \li gsl_vector_add_constant (gsl_vector * a, const double x) \section matrixmath Matrix math \li\ref apop_dot : matrix \f$\cdot\f$ matrix, matrix \f$\cdot\f$ vector, or vector \f$\cdot\f$ matrix \li\ref apop_matrix_determinant \li\ref apop_matrix_inverse \li\ref apop_det_and_inv : find determinant and inverse at the same time See the GSL documentation for myriad further options. \section sumstats Summary stats \li\ref apop_data_summarize \li\ref apop_vector_moving_average \li\ref apop_vector_percentiles \li\ref apop_vector_bounded See also: \li double gsl_matrix_max (const gsl_matrix * m) \li double gsl_matrix_min (const gsl_matrix * m) \li void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out) \li void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax) \li void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin) \li void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax) \li gsl_vector_max (const gsl_vector * v) \li gsl_vector_min (const gsl_vector * v) \li gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out) \li gsl_vector_max_index (const gsl_vector * v) \li gsl_vector_min_index (const gsl_vector * v) \li gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax) \section moments Moments For most of these, you can add a weights vector for weighted mean/var/cov/..., such as apop_vector_mean(d->vector, .weights=d->weights) \li\ref apop_mean : the first three with short names operate on a vector. \li\ref apop_sum \li\ref apop_var \li\ref apop_matrix_sum \li\ref apop_data_correlation \li\ref apop_data_covariance \li\ref apop_data_summarize \li\ref apop_matrix_mean \li\ref apop_matrix_mean_and_var \li\ref apop_vector_correlation \li\ref apop_vector_cov \li\ref apop_vector_kurtosis \li\ref apop_vector_kurtosis_pop \li\ref apop_vector_mean \li\ref apop_vector_skew \li\ref apop_vector_skew_pop \li\ref apop_vector_sum \li\ref apop_vector_var \li\ref apop_vector_var_m \section convsec Conversion among types There are no functions provided to convert from \ref apop_data to the constituent elements, because you don't need a function. If you need an individual element, you can use its pointer to retrieve it: \code apop_data *d = apop_query_to_mixed_data("vmmw", "select result, age, " "income, replicate_weight from data"); double avg_result = apop_vector_mean(d->vector, .weights=d->weights); \endcode In the other direction, you can use compound literals to wrap an \ref apop_data struct around a loose vector or matrix: \code //Given: gsl_vector *v; gsl_matrix *m; // Then this form wraps the elements into automatically-allocated apop_data structs. apop_data *dv = &(apop_data){.vector=v}; apop_data *dm = &(apop_data){.matrix=m}; apop_data *v_dot_m = apop_dot(dv, dm); //Here is a macro to hide C's ugliness: #define As_data(...) (&(apop_data){__VA_ARGS__}) apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m)); //The wrapped object is an automatically-allocated structure pointing to the //original data. If it needs to persist or be separate from the original, //make a copy: apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m)); \endcode \li\ref apop_array_to_vector : double*\f$\to\f$ gsl_vector \li\ref apop_data_fill : double*\f$\to\f$ \ref apop_data \li\ref apop_data_falloc : macro to allocate and fill a \ref apop_data set \li\ref apop_text_to_data : delimited text file\f$\to\f$ \ref apop_data \li\ref apop_text_to_db : delimited text file\f$\to\f$ database table \li\ref apop_vector_to_matrix \section names Name handling If you generate your data set via \ref apop_text_to_data or from the database via \ref apop_query_to_data (or \ref apop_query_to_text or \ref apop_query_to_mixed_data) then column names appear as expected. Set apop_opts.db_name_column to the name of a column in your query result to use that column name for row names. Sample uses, given \ref apop_data set d: \code int row_name_count = d->names->rowct int col_name_count = d->names->colct int text_name_count = d->names->textct //Manually add names in sequence: apop_name_add(d->names, "the vector", 'v'); apop_name_add(d->names, "row 0", 'r'); apop_name_add(d->names, "row 1", 'r'); apop_name_add(d->names, "row 2", 'r'); apop_name_add(d->names, "numeric column 0", 'c'); apop_name_add(d->names, "text column 0", 't'); apop_name_add(d->names, "The name of the data set.", 'h'); //or append several names at once apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3"); //point to element i from the row/col/text names: char *rowname_i = d->names->row[i]; char *colname_i = d->names->col[i]; char *textname_i = d->names->text[i]; //The vector also has a name: char *vname = d->names->vector; \endcode \li\ref apop_name_add : add one name \li\ref apop_data_add_names : add a sequence of names at once \li\ref apop_name_stack : copy the contents of one name list to another \li\ref apop_name_find : find the row/col number for a given name. \li\ref apop_name_print : print the \ref apop_name struct, for diagnostic purposes. \section textsec Text data The \ref apop_data set includes a grid of strings, named text, for holding text data. Text should be encoded in UTF-8. ASCII is a subset of UTF-8, so that's OK too. There are a few simple forms for handling the \c text element of an \c apop_data set. \li Use \ref apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks. See the example below. \li Use \ref apop_text_set to write text elements. It replaces any existing text in the given slot without memory leaks. \li The number of rows of text data in tdata is tdata->textsize[0]; the number of columns is tdata->textsize[1]. \li Refer to individual elements using the usual 2-D array notation, tdata->text[row][col]. \li x[0] can always be written as *x, which may save some typing. The number of rows is *tdata->textsize. If you have a single column of text data (i.e., all data is in column zero), then item \c i is *tdata->text[i]. If you know you have exactly one cell of text, then its value is **tdata->text. \li After \ref apop_text_alloc, all elements are the empty string "", which you can check via \code if (!strlen(dataset->text[i][j])) printf("") //or if (!*dataset->text[i][j]) printf("") \endcode For the sake of efficiency when dealing with large, sparse data sets, all blank cells point to the same static empty string, meaning that freeing cells must be done with care. Your best bet is to rely on \ref apop_text_set, \ref apop_text_alloc, and \ref apop_text_free to do the memory management for you. Here is a sample program that uses these forms, plus a few text-handling functions. \include eg/text_demo.c \li\ref apop_data_transpose() : also transposes the text data. Say that you use dataset = apop_query_to_text("select onecolumn from data"); then you have a sequence of strings, d->text[0][0], d->text[1][0], .... After apop_data *dt = apop_data_transpose(dataset), you will have a single list of strings, dt->text[0], which is often useful as input to list-of-strings handling functions. \li\ref apop_query_to_text \li\ref apop_text_alloc : allocate or resize the text part of an \ref apop_data set. \li\ref apop_text_set : replace a single cell of the text grid with new text. \li\ref apop_text_paste : convert a table of strings into one long string. \li\ref apop_text_unique_elements : get a sorted list of unique elements for one column of text. \li\ref apop_text_free : you may never need this, because \ref apop_data_free calls it. \li\ref apop_regex : friendlier front-end for POSIX-standard regular expression searching; pulls matches into an \ref apop_data set. \li\ref apop_text_unique_elements \subsection fact Generating factors \em Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors. A \em dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable in a regression) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors. The functions that generate factors and dummies will add an informational page to your \ref apop_data set with a name like \ listing the conversion from the artificial numeric factor to the original data. Use \ref apop_data_get_factor_names to get a pointer to that page. You can use the factor table to translate from numeric categories back to text (though you probably have the original text column in your data anyway). Having the factor list in an auxiliary table makes it easy to ensure that multiple \ref apop_data sets use the same single categorization scheme. Generate factors in the first set, then copy the factor list to the second, then run \ref apop_data_to_factors on the second: \code apop_data_to_factors(d1); d2->more = apop_data_copy(apop_data_get_factor_names(d1)); apop_data_to_factors(d2); \endcode See the documentation for \ref apop_logit for a sample linear model using a factor dependent variable and dummy independent variable. \li\ref apop_data_to_dummies \li\ref apop_data_to_factors \li\ref apop_data_get_factor_names */ /** \page dbs Databases These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you. You will probably first use \ref apop_text_to_db to pull data into the database, then \ref apop_query to clean the data in the database, and finally \ref apop_query_to_data to pull some subset of the data out for analysis. \li In all cases, your query may be in printf form. For example: \code char tabname[] = "demographics"; char colname[] = "heights"; int min_height = 175; apop_query("select %s from %s where %s > %i", colname, tabname, colname, min_height); \endcode See the \ref db_moments section below for not-SQL-standard math functions that you can use when sending queries from Apophenia, such as \c pow, \c stddev, or \c sqrt. \li \ref apop_text_to_db : Read a text file on disk into the database. Data analysis projects often start with a call to this. \li \ref apop_data_print : If you include the argument .output_type='d', this prints your \ref apop_data set to the database. \li \ref apop_query : Manipulate the database, return nothing (e.g., insert rows or create table). \li \ref apop_db_open : Optional, for when you want to use a database on disk. \li \ref apop_db_close : A useful (and in some cases, optional) companion to \ref apop_db_open. \li \ref apop_table_exists : Check to make sure you aren't reinventing or destroying data. Also, a clean way to drop a table. \li Apophenia reserves the right to insert temp tables into the opened database. They will all have names beginning with apop_, so the reader is advised to not generate tables with such names, and is free to ignore or delete any such tables that turn up. \li If you need to deal with two databases, use SQL's attach database. By default with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to use the faster in-memory database as the primary database, and then attach an on-disk database to read in data and write final output tables. \section edftd Extracting data from the database \li\ref apop_db_to_crosstab : take up to three columns in the database (row, column, value) and produce a table of values. \li\ref apop_query_to_data \li\ref apop_query_to_float \li\ref apop_query_to_mixed_data \li\ref apop_query_to_text \li\ref apop_query_to_vector \section wdttd Writing data to the database See the print functions at \ref Legi. E.g. \code apop_data_print(yourdata, .output_type='d', .output_name="dbtab"); \endcode \section cmdline Command-line utilities A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line: \li The \c apop_text_to_db command line utility is a wrapper for the \ref apop_text_to_db command. \li The \c apop_db_to_crosstab function is a wrapper for the \ref apop_db_to_crosstab function. \section db_moments Database moments (plus pow()!) SQLite lets users define new functions for use in queries, and Apophenia uses this facility to define a few common functions. \li select ran() from table will produce a new random number between zero and one for every row of the input table, using \c gsl_rng_uniform. \li The SQL standard includes the count(x) and avg(x) aggregators, but statisticians are usually interested in higher moments as well---at least the variance. Therefore, SQL queries using the Apophenia library may include any of these moments: \code select count(x), stddev(x), avg(x), var(x), variance(x), skew(x), kurt(x), kurtosis(x), std(x), stddev_samp(x), stddev_pop(x), var_samp(x), var_pop(x) from table group by whatever \endcode var and variance; kurt and kurtosis do the same thing; choose the one that sounds better to you. Kurtosis is the fourth central moment by itself, not adjusted by subtracting three or dividing by variance squared. var, var_samp, stddev and stddev_samp give sample variance/standard deviation; variance, var_pop, std and stddev_pop give population standard deviation. The plethora of variants are for mySQL compatibility. \li The var/skew/kurtosis functions calculate sample moments. If you want the second population moment, multiply the variance by \f$(n-1)/n\f$; for the third population moment, multiply the skew by \f$(n-1)(n-2)/n^2\f$. The equation for the unbiased sample kurtosis as calculated in Appendix M of Modeling with Data is not quite as easy to adjust. \li Also provided: wrapper functions for standard math library functions---sqrt(x), pow(x,y), exp(x), log(x), and trig functions. They call the standard math library function of the same name to calculate \f$\sqrt{x}\f$, \f$x^y\f$, \f$e^x\f$, \f$\ln(x)\f$, \f$\sin(x)\f$, \f$\arcsin(x)\f$, et cetera. For example: \code select sqrt(x), pow(x,0.5), exp(x), log(x), log10(x), sin(x), cos(x), tan(x), asin(x), acos(x), atan(x) from table \endcode \li The ran() function calls gsl_rng_uniform to produce a uniform draw between zero and one. It uses the stock of RNGs from \ref apop_rng_get_thread. Here is a test script using many of the above. \include db_fns.c */ /** \page modelsec Models See \ref gentle_model for an overview of the intent and basic use of the \ref apop_model struct. This segment goes into greater detail on the use of existing \ref apop_model objects. If you need to write a new model, see \ref modeldetails. The \c estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate: \code apop_data *data = apop_query_to_data("select outcome, in1, in2, in3 from dataset"); apop_model *the_estimate = apop_estimate(data, apop_probit); apop_model_print(the_estimate); \endcode Along the way to estimating the parameters, most models also find covariance estimates for the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show. The apop_probit model that ships with Apophenia is unparameterized: apop_probit->parameters==NULL. The output from the estimation, the_estimate, has the same form as apop_probit, but the_estimate->parameters has a meaningful value. Apophenia ships with many well-known models for your immediate use, including probability distributions, such as the \ref apop_normal, \ref apop_poisson, or \ref apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit. For example, given that the data is Normally distributed, find \f$\mu\f$ and \f$\sigma\f$ via apop_estimate(your_data, apop_normal). There are also linear models like \ref apop_ols, \ref apop_probit, and \ref apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model. \li If you send a data set with the \c weights vector filled, \ref apop_ols estimates Weighted OLS. \li If the dependent variable has more than two categories, the \ref apop_probit and \ref apop_logit models estimate a multinomial logit or probit. \li There are separate \ref apop_normal and \ref apop_multivariate_normal functions because the parameter formats are slightly different: the univariate Normal keeps both \f$\mu\f$ and \f$\sigma\f$ in the vector element of the parameters; the multivariate version uses the vector for the vector of means and the matrix for the \f$\Sigma\f$ matrix. The univariate version is so heavily used that it merits a special-case model. See the \ref models page for a list of models shipped with Apophenia, including popular favorites like \ref apop_beta, \ref apop_binomial, \ref apop_iv (instrumental variables), \ref apop_kernel_density, \ref apop_loess, \ref apop_lognormal, \ref apop_pmf (see \ref histosec below), and \ref apop_poisson. Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the \c p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters. More estimation output In the \ref apop_model returned by \ref apop_estimate, you will find: \li The actual parameter estimates are in an \ref apop_data set at \c your_model->parameters. \li A pointer to the \ref apop_data set used for estimation, named \c data. \li Scalar statistics of the model listed in the output model's \c info group, which may include some hypothesis tests, a list of expected values, log likelihood, AIC, AIC_c, BIC, et cetera. These can be retrieved via a form like \code apop_data_get(your_model->info, .rowname="log likelihood"); //or apop_data_get(your_model->info, .rowname="AIC"); \endcode If those are not necessary, adding to your model an \ref apop_parts_wanted_settings group with its default values (see below on settings groups) signals to the model that you want only the parameters and to not waste possibly significant CPU time on covariances, expected values, et cetera. See the \ref apop_parts_wanted_settings documentation for examples and further refinements. \li In many cases, covariances of the parameters as a page appended to the parameters; retrieve via \code apop_data *cov = apop_data_get_page(your_model->parameters, ""); \endcode \li Typically for regression-type models, the table of expected values (typically including expected value, actual value, and residual) is a page stapled to the main info page. Retrieve via: \code apop_data *predict = apop_data_get_page(your_model->info, ""); \endcode See individual model documentation for what is provided by any given model. Post-estimation uses But we expect much more from a model than estimating parameters from data. Continuing the above example where we got an estimated Probit model named \c the_estimate, we can interrogate the estimate in various familiar ways: \code apop_data *expected_value = apop_predict(NULL, the_estimate); double density_under = apop_cdf(expected_value, the_estimate); apop_data *draws = apop_model_draws(the_estimate, .count=1000); \endcode \subpage dataones \section modelparameterization Parameterizing or initializing a model The models that ship with Apophenia have the requisite procedures for estimation, making draws, and so on, but have parameters==NULL and settings==NULL. The model is thus, for many purposes, incomplete, and you will need to take some action to complete the model. As per the examples to follow, there are several possibilities: \li Estimate it! Almost all models can be sent with a data set as an argument to the apop_estimate function. The input model is unchanged, but the output model has parameters and settings in place. \li If your model has a fixed number of numeric parameters, then you can set them with \ref apop_model_set_parameters. \li If your model has a variable number of parameters, you can directly set the \c parameters element via \ref apop_data_falloc. For most purposes, you will also need to set the \c msize1, \c msize2, \c vsize, and \c dsize elements to the size you want. See the example below. \li Some models have disparate, non-numeric settings rather than a simple matrix of parameters. For example, an kernel density estimate needs a model as a kernel and a base data set, which can be set via \ref apop_model_set_settings. Here is an example that shows the options for parameterizing a model. After each parameterization, 20 draws are made and written to a file named draws-[modelname]. \include ../eg/parameterization.c \section transformsec Filtering & updating The model structure makes it easy to generate new models that are variants of prior models. Bayesian updating, for example, takes in one \ref apop_model that we call the prior, one \ref apop_model that we call a likelihood, and outputs an \ref apop_model that we call the posterior. One can produce complex models using simpler transformations as well. For example, \ref apop_model_fix_params will set the free parameters of an input model to a fixed value, thus producing a model with fewer parameters. To transform a Normal(\f$\mu\f$, \f$\sigma\f$) into a one-parameter Normal(\f$\mu\f$, 1): \code apop_model *N_sigma1 = apop_model_fix_params(apop_model_set_parameters(apop_normal, NAN, 1)); \endcode This can be used anywhere the original Normal distribution can be. To give another example, if we need to truncate the distribution in the data space: \code //The constraint function. double over_zero(apop_data *in, apop_model *m){ return apop_data_get(in) > 0; } apop_model *trunc = apop_model_dconstrain(.base_model=N_sigma1, .constraint=over_zero); \endcode Chaining together simpler transformations is an easy method to produce models of arbitrary detail. In the following example: \li Nature generated data using a mixture of three Poisson distributions, with \f$\lambda=2.8\f$, \f$2.0\f$, and \f$1.3\f$. The resulting model is generated using \ref apop_model_mixture. \li Not knowing the true distribution, the analyst models the data with a single Poisson\f$(\lambda)\f$ distribution with a prior on \f$\lambda\f$. The prior selected is a truncated Normal(2, 1), generated by sending the stock \ref apop_normal model to the data-space constraint function \ref apop_dconstrain. \li The \ref apop_update function takes three arguments: the data set, which comes from draws from the mixture, the prior, and the likelihood. It produces an output model which, in this case, is a PMF describing a distribution over \f$\lambda\f$, because a truncated Normal and a Poisson are not conjugate distributions. Knowing that it is a PMF, the ->data element holds a set of draws from the posterior. \li The analyst would like to present an approximation to the posterior in a simpler form, and so finds the parameters \f$\mu\f$ and \f$\sigma\f$ of the Normal distribution that is closest to that posterior. Here is a program---almost a single line of code---that builds the final approximation to the posterior model from the subcomponents, including draws from Nature and the analyst's prior and likelihood: \include ../eg/transform.c \section mathmethods Model methods \li\ref apop_estimate : estimate the parameters of the model with data. \li\ref apop_predict : the expected value function. \li\ref apop_draw : random draws from an estimated model. \li\ref apop_p : the probability of a given data set given the model. \li\ref apop_log_likelihood : the log of \ref apop_p \li\ref apop_score : the derivative of \ref apop_log_likelihood \li\ref apop_model_print : write model components to the screen or a file \li\ref apop_model_copy : duplicate a model \li\ref apop_model_set_parameters : Use this to convert a Normal(\f$\mu\f$, \f$\sigma\f$) with unknown \f$\mu\f$ and \f$\sigma\f$ into a Normal(0, 1), for example. \li\ref apop_model_free \li\ref apop_model_clear , \ref apop_prep : remove the parameters from a parameterized model. Used infrequently. \li\ref apop_model_draws : many random draws from an estimated model. \li\ref apop_update : Bayesian updating \li\ref apop_model_coordinate_transform : apply an invertible transformation to the data space \li\ref apop_model_dconstrain : constrain the data space of a model to a subspace. E.g., truncate a Normal distribution so \f$x>0\f$. \li\ref apop_model_fix_params : hold some parameters constant \li\ref apop_model_mixture : a linear combination of models \li\ref apop_model_cross : If \f$(d_1)\f$ has a Normal\f$(\mu, \sigma)\f$ distribution and \f$d_2\f$ has an independent Poisson\f$(\lambda)\f$ distribution, then \f$(d_1, d_2)\f$ has an apop_model_cross(apop_normal, apop_poisson) distribution with parameters \f$(\mu, \sigma, \lambda)\f$. \section modelsettings Settings groups Describing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. An MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of bins to be filled with data. So, the \ref apop_model includes a single list which can hold an arbitrary number of settings groups, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type. Settings groups are automatically initialized with default values when needed. If the defaults do no harm, then you don't need to think about these settings groups at all. Here is an example where a settings group is worth tweaking: the \ref apop_parts_wanted_settings group indicates which parts of the auxiliary data you want. \code 1 apop_model *m = apop_model_copy(apop_ols); 2 Apop_settings_add_group(m, apop_parts_wanted, .covariance='y'); 3 apop_model *est = apop_estimate(data, m); \endcode Line one establishes the baseline form of the model. Line two adds a settings group of type \ref apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to \c 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want. Notice that the \c _settings ending to the settings group's name isn't written---macros make it happen. The remaining arguments to \c Apop_settings_add_group (if any) follow the \ref designated syntax of the form .setting=value. There is an \ref apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above: \code apop_model *m = apop_model_copy_set(apop_ols, apop_parts_wanted, .covariance='y'); \endcode Settings groups are copied with the model, which facilitates chaining estimations. Continuing the above example, you could re-estimate to get the predicted values and covariance via: \code Apop_settings_set(est, apop_parts_wanted, predicted, 'y'); apop_model *est2 = apop_estimate(data, est); \endcode Maximum likelihood search has many settings that could be modified, and so provides another common example of using settings groups: \code apop_model *the_estimate = apop_estimate(data, apop_probit); //Redo the Probit's MLE search using Newton's Method: Apop_settings_add_group(the_estimate, apop_mle, .verbose='y', .tolerance=1e-4, .method="Newton"); apop_model *re_est = apop_estimate(data, the_estimate); \endcode To clarify the distinction between parameters and settings, note that parameters are estimated from the data, often via a maximum likelihood search. In an ML search, the method of search, the number of bins in a histogram, or the number of steps in a simulation would be held fixed as the search iterates over possible parameters (and if these settings do change, then that is a meta-model that could be encapsulated into another \ref apop_model). As a consequence, parameters are always numeric, while settings may be any type. \li \ref Apop_settings_set, for modifying a single setting, doesn't use the designated initializers format. \li Because the settings groups are buried within the model, debugging them can be a pain. Here is a documented macro for \c gdb that will help you pull a settings group out of a model for your inspection, to cut and paste into your .gdbinit. It shouldn't be too difficult to modify this macro for other debuggers. \code define get_group set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 ) p *$group end document get_group Gets a settings group from a model. Give the model name and the name of the group, like get_group my_model apop_mle and I will set a gdb variable named $group that points to that model, which you can use like any other pointer. For example, print the contents with p *$group The contents of $group are printed to the screen as visible output to this macro. end \endcode For using a model, that's all of what you need to know. For details on writing a new settings group, see \ref settingswriting . \li\ref Apop_settings_add_group \li\ref Apop_settings_set \li\ref Apop_settings_get : get a single element from a settings group. \li\ref Apop_settings_get_group : get the whole settings group. */ /** \page dataones Data format for regression-type models Regression-type estimations typically require a constant column. That is, the 0th column of the data is a constant (one), so the parameter \f$\beta_0\f$ is slightly special in corresponding to a constant rather than a variable. Some stats packages implicitly assume a constant column, which the user never sees. This violates the principle of transparency upon which Apophenia is based. Given a data matrix \f$X\f$ with the estimated parameters \f$\beta\f$, if the model asserts that the product \f$X\beta\f$ has meaning, then you should be able to easily calculate that product. With a ones column, a dot product is one line: apop_dot(x, your_est->parameters); without a ones column, one would basically have to construct one (using \c gsl_matrix_set_all and \c apop_data_stack). Each regression-type estimation has one dependent variable and several independent. In the end, we want the dependent variable to be in the vector element. Removing a column from a gsl_matrix and adjusting all subsequent columns is relatively difficult, because (like most structs built with the aim of very efficient processing) the struct depends on an equal spacing in memory between each element. The automatic case We can resolve both the need for a ones column and for having the dependent column in the vector at the same time. Given a data set with no vector element and the dependent variable in the first column of the matrix, we can copy the dependent variable into the vector and then replace the first column of the matrix with ones. The result fits all of the above expectations. You as a user merely have to send in a \c apop_data set with \c NULL vector and a dependent column in the first column. If the data is coming from the database, then the query is natural: \code apop_data *regression_data = apop_query_to_data("select depvar, indyvar1, indyvar2, indyvar3 from dataset"); apop_model_print(apop_estimate(regression_data, apop_ols)); \endcode The already-prepped case If your data has a vector element, then the prep routines won't change anything. If you don't want to use a constant column, or your data has already been prepped by an estimation, then this is what you want. \code apop_data *regression_data = apop_query_to_mixed_data("vmmm", "select depvar, indyvar1, indvar2, indvar3 from dataset"); apop_model_print(apop_estimate(regression_data, apop_logit)); \endcode */ /** \page testpage Tests & diagnostics Here is the model for all hypothesis testing within Apophenia: \li Calculate a statistic. \li Describe the distribution of that statistic. \li Work out how much of the distribution is (above|below|closer to zero than) the statistic. There are a handful of named tests that produce a known statistic and then compare to a known distribution, like \ref apop_test_kolmogorov or \ref apop_test_fisher_exact. For traditional distributions (Normal, \f$t\f$, \f$\chi^2\f$), use the \ref apop_test convenience function. In especially common cases, like the parameters from an OLS regression, the commonly-associated \f$t\f$ test is included as part of the estimation output, typically as a row in the \c info element of the output \ref apop_model. \li\ref apop_test \li\ref apop_paired_t_test \li\ref apop_f_test \li\ref apop_t_test \li\ref apop_test_anova_independence \li\ref apop_test_fisher_exact \li\ref apop_test_kolmogorov \li\ref apop_estimate_coefficient_of_determination \li\ref apop_estimate_r_squared See also these Monte Carlo methods: \li\ref apop_bootstrap_cov \li\ref apop_jackknife_cov To give another example of testing, here is a function that was briefly a part of Apophenia, but seemed a bit out of place. Here it is as a sample: \code // Input: any vector, which will be normalized in place. Output: 1 - the p-value // for a chi-squared test to answer the question, "with what confidence can I // reject the hypothesis that the variance of my data is zero?" double apop_test_chi_squared_var_not_zero(gsl_vector *in){ Apop_stopif(!in, return NAN, 0, "input vector is NULL. Doing nothing."); apop_vector_normalize(in, .normalization_type='s'); double sum=apop_vector_map_sum(in, gsl_pow_2); return gsl_cdf_chisq_P(sum, in->size); } \endcode Or, consider the Rao statistic, \f${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)\f$ where \f$L\f$ is a model's likelihood function and \f$I\f$ its information matrix. In code: \code apop_data * infoinv = apop_model_numerical_covariance(data, your_model); apop_data * score = &(apop_data*){.vector=apop_numerical_gradient(data, your_model)}; apop_data * stat = apop_dot(apop_dot(score, infoinv), score); \endcode Given the correct assumptions, this is \f$\sim \chi^2_m\f$, where \f$m\f$ is the dimension of \f$\beta\f$, so the odds of a Type I error given the model is: \code double p_value = apop_test(stat, "chi squared", beta->size); \endcode Generalized parameter tests But if your model is not from the textbook, then you have the tools to apply the above three-step process to the parameters of any \ref apop_model. \li Model parameters are a statistic, and you know that apop_estimate(your_data, your_model) will output a model with a parameters element. \li \ref apop_parameter_model will return an \ref apop_model describing the distribution of these parameters. \li We now have the two ingredients to send to \ref apop_cdf, which takes in a model and a data point and returns the area under the data point. Defaults for the parameter models are filled in via bootstrapping or resampling, meaning that if your model's parameters are decidedly off the Normal path, you can still test claims about the parameters. The introductory example in \ref gentle ran a standard OLS regression, whose output includes some standard hypothesis tests; to conclude, let us go the long way and replicate those results via the general \ref apop_parameter_model mechanism. The results here will of course be identical, but the more general mechanism can be used in situations where the standard models don't apply. The first part of this program is identical to the introductory program, using \c ss08pdc.csv if you have downloaded it as per the instructions in \ref gentle, or a simple sample data set if not. The second half executes the three steps uses many of the above features: one of the inputs to \ref apop_parameter_model (which row of the parameter set to use) is sent by adding a settings group, we pull that row into a separate data set using \ref Apop_r, and we set its vector value by referring to it as the -1st element. \include ols2.c Note that the procedure did not assume the model parameters had a certain form. It queried the model for the distribution of parameter \c agep, and if the model didn't have a closed-form answer then a distribution via bootstrap would be provided. Then that model was queried for its CDF. [The procedure does assume a symmetric distribution. Fixing this is left as an exercise for the reader.] For a model like OLS, this is entirely overkill, which is why OLS provides the basic hypothesis tests automatically. But for models where the distribution of parameters is unknown or has no closed-form solution, this may be the only recourse. */ /** \page histosec Empirical distributions and PMFs (probability mass functions) The \ref apop_pmf model wraps an \ref apop_data set so it can be read as an empirical model, with a likelihood function (equal to the associated weight for observed values and zero for unobserved values), a random number generator (which simply makes weighted random draws from the data), and so on. Setting it up is a model estimation from data like any other, done via \ref apop_estimate(\c your_data, \ref apop_pmf). You have the option of cleaning up the data before turning it into a PMF. For example... \code apop_data_pmf_compress(your_data); //remove duplicates apop_data_sort(your_data); apop_vector_normalize(your_data->weights); //weights sum to one apop_model *a_pmf = apop_estimate(your_data, apop_pmf); \endcode These are largely optional. \li The CDF is calculated based on the percent of the weights between the zeroth row of the PMF and the row specified. This generally makes more sense after \ref apop_data_sort. \li Compression produces a corresponding improvement in efficiency when first calculating CDFs, but is otherwise not necessary. \li Sorting or normalizing is not necessary for making draws or getting a likelihood or log likelihood. It is the \c weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no \c weights segment, then I assume that all rows have equal weight. For a PMF model, the \c parameters are \c NULL, and the \c data itself is used for calculation. Therefore, modifying the data post-estimation can break some internal settings set during estimation. If you modify the data, throw away any existing PMFs (via \ref apop_model_free) and re-estimate a new one. \section histocompare Comparing histograms Using \ref apop_data_pmf_compress puts the data into one bin for each unique value in the data set. You may instead want bins of fixed with, in the style of a histogram, which you can get via \ref apop_data_to_bins. It requires a bin specification. If you send a \c NULL binspec, then the offset is zero and the bin size is big enough to ensure that there are \f$\sqrt{N}\f$ bins from minimum to maximum. The binspec will be added as a page to the data set, named "". See the \ref apop_data_to_bins documentation on how to write a custom bin spec. There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec. For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync: \code apop_data_to_bins(first_set, NULL); apop_data_to_bins(second_set, apop_data_get_page(first_set, "")); \endcode You can use \ref apop_test_kolmogorov or \ref apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins. Kernel density estimation will produce a smoothed PDF. See \ref apop_kernel_density for details. Or, use \ref apop_vector_moving_average for a simpler smoothing method. \li\ref apop_data_pmf_compress() : merge together redundant rows in a data set before calling \ref apop_estimate(\c your_data, \ref apop_pmf); optional. \li\ref apop_vector_moving_average() : smooth a vector (e.g., your_pmf->data->weights) via moving average. \li\ref apop_histograms_test_goodness_of_fit() : goodness-of-fit via \f$\chi^2\f$ statistic \li\ref apop_test_kolmogorov() : goodness-of-fit via Kolmogorov-Smirnov statistic \li\ref apop_kl_divergence() : measure the information loss from one (typically empirical) distribution to another distribution. */ /** \page maxipage Optimization This section includes some notes on the maximum likelihood routine. As in the section on writing models above, if a model has a \c p or \c log_likelihood method but no \c estimate method, then calling \c apop_estimate(your_data, your_model) executes the default estimation routine of maximum likelihood. If you are a not a statistician, then there are a few things you will need to keep in mind: \li Physicists, pure mathematicians, and the GSL minimize; economists, statisticians, and Apophenia maximize. If you are doing a minimization, be sure that your function returns minus the objective function's value. \li The overall setup is about estimating the parameters of a model with data. The user provides a data set and an unparameterized model, and the system tries parameterized models until one of them is found to be optimal. The data is fixed. The optimization tries a series of parameterized models, searching for the one that is most likely. In a non-stats setting, you may have \c NULL data. \li Because the unit of analysis is a parameterized model, not just parameters, you need to have an \ref apop_model wrapping your objective function. This example, to be discussed in detail below, optimizes Rosenbrock's banana function, \f$(1-x)^2+ s(y - x^2)^2\f$, where the scaling factor \f$s\f$ is fixed ahead of time, say at 100. \include ../eg/banana.c The \c banana function returns a single number to be minimized. You will need to write an \ref apop_model to send to the optimizer, which is a two step process: write a log likelihood function wrapping the real objective function (\c ll), and a model that uses that log likelihood (\c b). \li The .vsize=2 part of the declaration of \c b on the second line of main() specified that the model's parameters are a vector of size two. That is, the list of doubles to send to \c banana is set in in->parameters->vector->data. \li The \c more element of the \ref apop_model structure is designed to hold any arbitrary structure of size \c more_size, which is useful for models that require additional constants or other settings, like the coeff_struct here. See \ref settingswriting for more on handling model settings. \li Statisticians want the covariance and basic tests about the parameters. This line shuts off all auxiliary calculations: \code Apop_settings_add_group(your_model, apop_parts_wanted); \endcode See the documentation for \ref apop_parts_wanted_settings for details about how this works. It can also offer quite the speedup: especially for high-dimensional problems, finding the covariance matrix without any additional information can take dozens of evaluations of the objective function for each evaluation that is part of the search itself. \li MLEs have an especially large number of parameter tweaks that could be made; see the \ref apop_mle_settings page. \li As a useful diagnostic, you can add a \c NULL \ref apop_data set to the MLE settings in the .path slot, and it will be allocated and filled with the sequence of points tried by the optimizer. \li The program has some extras above and beyond the necessary: it uses two methods (notice how easy it is to re-run an estimation with an alternate method, but the syntax for modifying a setting differs from the initialization syntax) and checks that the results are accurate. \section constr Setting Constraints The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem. If you give the optimizer an unconstrained likelihood function plus a separate constraint function, \ref apop_maximum_likelihood will combine them to a function that is continuous at the constraint boundary, but which is guaranteed to never have an optimum outside of the constraint. A constraint function must do three things: \li If the constraint does not bind (i.e. the parameter values are OK), then it must return zero. \li If the constraint does bind, it must return a penalty, that indicates how far off the parameter is from meeting the constraint. \li If the constraint does bind, it must set a return vector that the likelihood function can take as a valid input. The penalty at this returned value must be zero. The idea is that if the constraint returns zero, the log likelihood function will return the log likelihood as usual, and if not, it will return the log likelihood at the constraint's return vector minus the penalty. To give a concrete example, here is a constraint function that will ensure that both parameters of a two-dimensional input are both greater than zero, and that their sum is greater than two. As with the constraints for many of the models that ship with Apophenia, it is a wrapper for \ref apop_linear_constraint. \code static long double greater_than_zero_constraint(apop_data *data, apop_model *v){ static apop_data *constraint = NULL; if (!constraint) constraint= apop_data_falloc((3,3,2), 0, 1, 0, //0 < 1x + 0y 0, 0, 1, //0 < 0x + 1y 2, 1, 1); //2 < 1x + 1y return apop_linear_constraint(v->parameters->vector, constraint, 1e-3); } \endcode \li\ref apop_linear_constraint() \section simanneal Notes on simulated annealing For convex optimizations, methods like conjugate gradient search work well, and for relatively smooth optimizations, the Nelder-Mead simplex algorithm is a good choice. For situations where the surface being searched may have several local optima and be otherwise badly behaved, there is simulated annealing. Simulated annealing is a controlled random walk. As with the other methods, the system tries a new point, and if it is better, switches. Initially, the system is allowed to make large jumps, and then with each iteration, the jumps get smaller, eventually converging. Also, there is some decreasing probability that if the new point is less likely, it will still be chosen. Simulated annealing is best for situations where there may be multiple local optima. Early in the random walk, the system can readily jump from one to another; later it will fine-tune its way toward the optimum. The number of points tested is determined by the parameters of the simulated colling program, not the values returned by the likelihood function. If you know your function is globally convex (as are most standard probability functions), then this method is overkill. \section mlfns Useful functions \li\ref apop_estimate_restart : Restarting an MLE with different settings can improve results. \li\ref apop_maximum_likelihood : Rarely called directly. If a model has no \c estimate element, call \ref apop_estimate to prep the model and run an MLE. \li\ref apop_model_numerical_covariance \li\ref apop_numerical_gradient */ /** \page moreasst Assorted Some functions for missing data: \li\ref apop_data_listwise_delete \li\ref apop_ml_impute A few more descriptive methods: \li\ref apop_matrix_pca : Principal component analysis \li\ref apop_anova : One-way or two-way ANOVA tables \li\ref apop_rake : Iterative proportional fitting on large, sparse tables General utilities: \li\ref Apop_stopif : Apophenia's error-handling and warning-printing macro. \li\ref apop_opts : the global options \li\ref apop_system : a printf-style wrapper around the standard \c system function. A few more math utilities: \li\ref apop_matrix_is_positive_semidefinite \li\ref apop_matrix_to_positive_semidefinite \li\ref apop_generalized_harmonic \li\ref apop_multivariate_gamma \li\ref apop_multivariate_lngamma \li\ref apop_rng_alloc */ /** \page mingw MinGW Minimalist GNU for Windows is indeed minimalist: it is not a full POSIX subsystem, and provides no package manager. Therefore, you will have to make some adjustments and install the dependencies yourself. Matt P. Dziubinski successfully used Apophenia via MinGW; here are his instructions (with edits by BK): \li get libregex (the ZIP file) from: http://sourceforge.net/project/showfiles.php?group_id=204414&package_id=306189 \li get libintl (three ZIP files) from: http://gnuwin32.sourceforge.net/packages/libintl.htm . download "Binaries", "Dependencies", "Developer files" \li follow "libintl" steps from: http://kayalang.org/download/compiling/windows \li Modify \c Makefile, adding -lpthread to AM_CFLAGS (removing -pthread) and -lregex to AM_CFLAGS and LIBS \li Now compile the main library: \code make \endcode \li Finally, put one more expected directory in place and install: \code mkdir -p -- "/usr/local/Lib/site-packages" make install \endcode \li You will get the usual warning about library paths, and may have to take the specified action: \code ---------------------------------------------------------------------- Libraries have been installed in: /usr/local/lib If you ever happen to want to link against installed libraries in a given directory, LIBDIR, you must either use libtool, and specify the full pathname of the library, or use the `-LLIBDIR' flag during linking and do at least one of the following: - add LIBDIR to the `PATH' environment variable during execution - add LIBDIR to the `LD_RUN_PATH' environment variable during linking - use the `-LLIBDIR' linker flag See any operating system documentation about shared libraries for more information, such as the ld(1) and ld.so(8) manual pages. ---------------------------------------------------------------------- \endcode */ /* optionaldetails Implementation of optional arguments [this section ignored by doxygen] Optional and named arguments are among the most commonly commented-on features of Apophenia, so this page goes into full detail about the implementation. To use these features, see the all-you-really-need summary at the \ref designated page. For a background and rationale, see the blog entry at http://modelingwithdata.org/arch/00000022.htm . I'll assume you've read both links before continuing. OK, now that you've read the how-to-use and the discussion of how optional and named arguments can be constructed in C, this page will show how they are done in Apophenia. The level of details should be sufficient to implement them in your own code if you so desire. There are three components to the process of generating optional arguments as implemented here: \li Produce a \c struct whose elements match the arguments to the function. \li Write a wrapper function that takes in the struct, unpacks it, and calls the original function. \li Write a macro that makes the user think the wrapper function is the real thing. None of these steps are really rocket science, but there is a huge amount of redundancy. Apophenia includes some macros that reduce the boilerplate redundancy significantly. There are two layers: the C-standard code, and the script that produces the C-standard code. We'll begin with the C-standard header file: \code #ifdef APOP_NO_VARIADIC void apop_vector_increment(gsl_vector * v, int i, double amt); #else void apop_vector_increment_base(gsl_vector * v, int i, double amt); apop_varad_declare(void, apop_vector_increment, gsl_vector * v; int i; double amt); #define apop_vector_increment(...) apop_varad_link(apop_vector_increment, __VA_ARGS__) #endif \endcode First, there is an if/else that allows the system to degrade gracefully if you are sending C code to a parser like swig, whose goals differ too much from straight C compilation for this to work. Set \c APOP_NO_VARIADIC to produce a plain function with no variadic support. Else, we begin the above steps. The \c apop_varad_declare line expands to the following: \code typedef struct { gsl_vector * v; int i; double amt ; } variadic_type_apop_vector_increment; void variadic_apop_vector_increment(variadic_type_apop_vector_increment varad_in); \endcode So there's the ad-hoc struct and the declaration for the wrapper function. Notice how the arguments to the macro had semicolons, like a struct declaration, rather than commas, because the macro does indeed wrap the arguments into a struct. Here is what the \c apop_varad_link would expand to: \code #define apop_vector_increment(...) variadic_apop_increment_base((variadic_type_apop_vector_increment) {__VA_ARGS__}) \endcode That gives us part three: a macro that lets the user think that they are making a typical function call with a set of arguments, but wraps what they type into a struct. Now for the code file where the function is declared. Again, there is is an \c APOP_NO_VARIADIC wrapper. Inside the interesting part, we find the wrapper function to unpack the struct that comes in. \code \#ifdef APOP_NO_VARIADIC void apop_vector_increment(gsl_vector * v, int i, double amt){ \#else apop_varad_head( void , apop_vector_increment){ gsl_vector * apop_varad_var(v, NULL); Apop_stopif(!v, return, 0, "You sent me a NULL vector."); int apop_varad_var(i, 0); double apop_varad_var(amt, 1); apop_vector_increment_base(v, i, amt); } void apop_vector_increment_base(gsl_vector * v, int i, double amt){ #endif v->data[i * v->stride] += amt; } \endcode The \c apop_varad_head macro reduces redundancy, and will expand to \code void variadic_apop_vector_increment (variadic_type_variadic_apop_vector_increment varad_in) \endcode The function with this header thus takes in a single struct, and for every variable, there is a line like \code double apop_varad_var(amt, 1); \endcode which simply expands to: \code double amt = varad_in.amt ? varad_in.amt : 1; \endcode Thus, the macro declares each not-in-struct variable, and so there will need to be one such declaration line for each argument. Apart from requiring declarations, you can be creative: include sanity checks, post-vary the variables of the inputs, unpack without the macro, and so on. That is, this parent function does all of the bookkeeping, checking, and introductory shunting, so the base function can do the math. Finally, the introductory section will call the base function. The setup goes out of its way to leave the \c _base function in the public namespace, so that those who would prefer speed to bounds-checking can simply call that function directly, using standard notation. You could eliminate this feature by merging the two functions. The m4 script The above is all you need to make this work: the varad.h file, and the above structures. But there is still a lot of redundancy, which can't be eliminated by the plain C preprocessor. Thus, in Apophenia's code base (the one you'll get from checking out the git repository, not the gzipped distribution that has already been post-processed) you will find a pre-preprocessing script that converts a few markers to the above form. Here is the code that will expand to the above C-standard code: \code //header file APOP_VAR_DECLARE void apop_vector_increment(gsl_vector * v, int i, double amt); //code file APOP_VAR_HEAD void apop_vector_increment(gsl_vector * v, int i, double amt){ gsl_vector * apop_varad_var(v, NULL); Apop_stopif(!v, return, 0, "You sent me a NULL vector."); int apop_varad_var(i, 0); double apop_varad_var(amt, 1); APOP_VAR_END_HEAD v->data[i * v->stride] += amt; } \endcode It is obviously much shorter. The declaration line is actually a C-standard declaration with the \c APOP_VAR_DECLARE preface, so you don't have to remember when to use semicolons. The function itself looks like a single function, but there is again a marker before the declaration line, and the introductory material is separated from the main matter by the \c APOP_VAR_END_HEAD line. Done right, drawing a line between the introductory checks or initializations and the main function can improve readability. The m4 script inserts a return function_base(...) at the end of the header function, so you don't have to. If you want to call the function before the last line, you can do so explicitly, as in the expansion above, and add a bare return; to guarantee that the call to the base function that the m4 script will insert won't ever be reached. One final detail: it is valid to have types with commas in them---function arguments. Because commas get turned to semicolons, and m4 isn't a real parser, there is an exception built in: you will have to replace commas with exclamation marks in the header file (only). E.g., \code APOP_VAR_DECLARE apop_data * f_of_f(apop_data *in, void *param, int n, double (*fn_d)(double ! void * !int)); \endcode m4 is POSIX standard, so even if you can't read the script, you have the program needed to run it. For example, if you name it \c prep_variadics.m4, then run \code m4 prep_variadics.m4 myfile.m4.c > myfile.c \endcode */ /** \page gentle A quick overview This is a "gentle introduction" to the Apophenia library. It is intended to give you some initial bearings on the typical workflow and the concepts and tricks that the manual pages assume you are familiar with. If you want to install Apophenia now so you can try the samples on this page, see the \ref setup page. An outline of this overview: \li Apophenia fills a space between traditional C libraries and stats packages. \li The \ref apop_data structure represents a data set (of course). Data sets are inherently complex, but there are many functions that act on \ref apop_data sets to make life easier. \li The \ref apop_model encapsulates the sort of actions one would take with a model, like estimating model parameters or predicting values based on new inputs. \li Databases are great, and a perfect fit for the sort of paradigm here. Apophenia provides functions to make it easy to jump between database tables and \ref apop_data sets. The opening example Setting aside the more advanced applications and model-building tasks, let us begin with the workflow of a typical fitting-a-model project using Apophenia's tools: \li Read the raw data into the database using \ref apop_text_to_db. \li Use SQL queries handled by \ref apop_query to massage the data as needed. \li Use \ref apop_query_to_data to pull some of the data into an in-memory \ref apop_data set. \li Call a model estimation such as \code apop_estimate (data_set, apop_ols)\endcode or \code apop_estimate (data_set, apop_probit)\endcode to fit parameters to the data. This will return an \ref apop_model with parameter estimates. \li Interrogate the returned estimate, by dumping it to the screen with \ref apop_model_print, sending its parameters and variance-covariance matrices to additional tests (the \c estimate step runs a few for you), or send the model's output to be input to another model. Here is an example of most of the above steps which you can compile and run, to be discussed in detail below. The program relies on the U.S. Census's American Community Survey public use microdata for DC 2008, which you can get from the command line via: \code wget https://raw.github.com/rodri363/tea/master/demo/ss08pdc.csv \endcode or by pointing your browser to that address and saving the file. The program: \code #include int main(){ apop_text_to_db(.text_file="ss08pdc.csv", .tabname="dc"); apop_data *data = apop_query_to_data("select log(pincp+10) as log_income, agep, sex " "from dc where agep+ pincp+sex is not null and pincp>=0"); apop_model *est = apop_estimate(data, apop_ols); apop_model_print(est); } \endcode If you saved the code to census.c and don't have a \ref makefile or other build system, then you can compile it with \code gcc census.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o census \endcode or \code clang census.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o census \endcode and then run it with ./census. This compile line will work on any system with all the requisite tools, but for full-time work with this or any other C library, you will probably want to write a \ref makefile. The results are unremarkable---age has a positive effect on income, and sex (1=male, 2=female) does has a negative effect---but it does give us some lines of code to dissect. The first two lines in \c main() make use of a database. I'll discuss the value of the database step more at the end of this page, but for now, note that there are several functions, \ref apop_query and \ref apop_query_to_data being the ones you will most frequently be using, that will allow you to talk to and pull data from either an SQLite or mySQL/mariaDB database. The database is a natural place to do data processing like renaming variables, selecting subsets, and transforming values. Designated initializers Like this line, \code apop_text_to_db(.text_file="data", .tabname="d"); \endcode many Apophenia functions accept named, optional arguments. To give another example, the \ref apop_data set has the usual row and column numbers, but also row and column names. So you should be able to refer to a cell by any combination of name or number; for the data set you read in above, which has column names, all of the following work: \code x = apop_data_get(data, 2, 3); //observation 2, column 3 x = apop_data_get(data, .row=2, .colname="sex"); // same apop_data_set(data, 2, 3, 1); apop_data_set(data, .colname="sex", .row=2, .val=1); \endcode Default values mean that the \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr functions handle matrices, vectors, and scalars sensibly: \code //Let v be a hundred-element vector: apop_data *v = apop_data_alloc(100); [fill with data here] double x1 = apop_data_get(v, 10); apop_data_set(v, 2, .val=x1); //A 100x1 matrix behaves like a vector apop_data *m = apop_data_alloc(100, 1); [fill with data here] double m1 = apop_data_get(v, 1); //let s be a scalar stored in a 1x1 apop_data set: apop_data *v = apop_data_alloc(1); double *scalar = apop_data_ptr(s); \endcode These conveniences may be new to users of less user-friendly C libraries, but it it fully conforms to the C standard (ISO/IEC 9899:2011). See the \ref designated page for details. \section apop_data A lot of real-world data processing is about quotidian annoyances about text versus numeric data or dealing with missing values, and the \ref apop_data set and its many support functions are intended to make data processing in C easy. Some users of Apophenia use the library only for its \ref apop_data set and associated functions. See \ref dataoverview for extensive notes on using the structure. The structure includes seven parts: \li a vector, \li a matrix, \li a grid of text elements, \li a vector of weights, \li names for everything: row names, a vector name, matrix column names, text names, \li a link to a second page of data, and \li an error marker This is not a generic and abstract ideal, but is the sort of mess that real-world data sets look like. For example, here is some data for a weighted OLS regression. It includes an outcome variable in the vector, dependent variables in the matrix and text grid, replicate weights, and column names in bold labeling the variables: \htmlinclude apop_data_fig.html \latexinclude apop_data_fig.tex Apophenia's functions generally assume that one row across all of these elements describes a single observation or data point. See above for some examples of getting and setting individual elements. Also, \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr consider the vector to be the -1st column, so using the data set in the figure, apop_data_get(sample_set, .row=0, .col=-1) == 1. Reading in data As per the example above, use \ref apop_text_to_data or \ref apop_text_to_db and then \ref apop_query_to_data. Subsets There are many macros to get views of subsets of the data. Each generates a disposable wrapper around the base data: once the variable goes out of scope, the wrapper disappears, but modifications made to the data in the view are modifications to the base data itself. \include simple_subsets.c All of these slicing routines are macros, because they generate several background variables in the current scope (something a function can't do). Traditional custom is to put macro names in all caps, like \c APOP_DATA_ROWS, which to modern sensibilities looks like yelling. The custom has a logic: there are ways to hang yourself with macros, so it is worth distinguishing them typographically. Apophenia tones it down by capitalizing only the first letter. Basic manipulations See \ref dataoverview for a list of many other manipulations of data sets, such as \ref apop_data_listwise_delete for quick-and-dirty removal of observations with NaNs, \ref apop_data_split / \ref apop_data_stack, or \ref apop_data_sort to sort all elements by a single column. Apply and map If you have an operation of the form for each element of my data set, call this function, then you can use \ref apop_map to do it. You could basically do everything you can do with an apply/map function via a \c for loop, but the apply/map approach is clearer and more fun. Also, if you set OpenMP's omp_set_num_threads(N) for any \c N greater than 1 (the default on most systems is the number of CPU cores), then the work of mapping will be split across multiple CPU threads. See \ref mapply for a number of examples. Text String handling in C usually requires some tedious pointer and memory handling, but the functions to put strings into the text grid in the \ref apop_data structure and get them out again will do the pointer shunting for you. The \ref apop_text_alloc function is really a realloc function: you can use it to resize the text grid as necessary. The \ref apop_text_set function will write a single string to the grid, though you may be using \ref apop_query_to_text or \ref apop_query_to_mixed_data to read in an entire data set at once. Functions that act on entire data sets, like \ref apop_data_rm_rows, handle the text part as well. The text grid for \c your_data has your_data->textsize[0] rows and your_data->textsize[1] columns. If you are using only the functions to this point, then empty elements are a blank string (""), not \c NULL. For reading individual elements, refer to the \f$(i,j)\f$th text element via your_data->text[i][j]. Errors Many functions will set the error element of the \ref apop_data structure being operated on if anything goes wrong. You can use this to halt the program or take corrective action: \code apop_data *the_data = apop_query_to_data("select * from d"); Apop_stopif(!the_data || the_data->error, exit(1), 0, "Trouble querying the data"); \endcode The whole structure Here is a diagram of all of Apophenia's structures and how they relate. It is taken from this cheat sheet on general C and SQL use (2 page PDF). \image html http://apophenia.info/structs.png width="100%" \image latex ../structs.png width=18cm All of the elements of the \ref apop_data structure are laid out at middle-left. You have already met the vector, matrix, weights, and text grid. The diagram shows the \ref apop_name structure, which has received little mention so far because names basically take care of themselves. A query will bring in column names (and row names if you set apop_opts.db_name_column), or use \ref apop_data_add_names to add names to your data set and \ref apop_name_stack to copy from one data set to another. The \ref apop_data structure has a \c more element, for when your data is best expressed in more than one page of data. Use \ref apop_data_add_page, \ref apop_data_rm_page, and \ref apop_data_get_page. Output routines will sometimes append an extra page of auxiliary information to a data set, such as pages named \ or \. The angle-brackets indicate a page that describes the data set but is not a part of it (so an MLE search would ignore that page, for example). Now let us move up the structure diagram to the \ref apop_model structure. \section gentle_model apop_model Even restricting ourselves to the most basic operations, there are a lot of things that we want to do with our models: use a data set to estimate the parameters of a model (like the mean and variance of a Normal distribution), or draw random numbers, or show the expected value, or show the expected value of one part of the data given fixed values for the rest of it. The \ref apop_model is intended to encapsulate most of these desires into one object, so that models can easily be swapped around, modified to create new models, compared, and so on. From the figure above, you can see that the \ref apop_model structure includes a number of informational items, key being the \c parameters, \c data, and \c info elements; a list of settings to be discussed below; and a set of procedures for many operations. Its contents are not (entirely) arbitrary: the overall intent and the theoretical basis for what is and is not included in an \ref apop_model are described in this U.S. Census Bureau research report. There are helper functions that will allow you to avoid dealing with the model internals. For example, the \ref apop_estimate helper function means you never have to look at the model's \c estimate method (if it even has one), and you will simply pass the model to a function, as with the above form: \code apop_model *est = apop_estimate(data, apop_ols); \endcode \li Apophenia ships with a broad set of models, like \ref apop_ols, \ref apop_dirichlet, \ref apop_loess, and \ref apop_pmf (probability mass function); see the full list on the models documentation page. You would fit any of them using \ref apop_estimate call, with the appropriate model as the second input. \li The models that ship with Apophenia, like \ref apop_ols, include the procedures and some metadata, but are of course not yet estimated using a data set (i.e., data == NULL, parameters == NULL). The line above generated a new model, \c est, which is identical to the base OLS model but has estimated parameters (and covariances, and basic hypothesis tests, a log likelihood, \f$AIC_c\f$, \f$BIC\f$, et cetera), and a \c data pointer to the \ref apop_data set used for estimation. \li You will mostly use the models by passing them as inputs to functions like \ref apop_estimate, \ref apop_draw, or \ref apop_predict; more examples below. Other than \ref apop_estimate, most require a parameterized model like \c est. After all, it doesn't make sense to draw from a Normal distribution until its mean and standard deviation are specified. \li If you know what the parameters should be, for most models use \ref apop_model_set_parameters. E.g. \code apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1); apop_data *a_thousand_normals = apop_model_draws(std_normal, 1000); apop_model *poisson = apop_model_set_parameters(apop_poisson, 1.5); apop_data *a_thousand_waits = apop_model_draws(poisson, 1000); \endcode \li You can use \ref apop_model_print to print the various elements to screen. \li You can combine and transform models with functions such as \ref apop_model_fix_params, \ref apop_model_coordinate_transform, or \ref apop_model_mixture. Each of these functions produce a new model, which can be estimated, re-combined, or otherwise used like any other model. \code //A helper function to check whether a data point is nonnegative double over_zero(apop_data *in, apop_model *m){ return apop_data_get(in) > 0; } //Generate a truncated Normal distribution by adding a data constraint: apop_model *truncated_normal= apop_model_dconstrain(.base_model=apop_normal, .constraint=over_zero); //Get the cross product of that and a free Normal. apop_model *cross = apop_model_cross(apop_normal, truncated_normal); //Given assumed data, estimate the parameters of the cross product. apop_model *xest = apop_estimate(assumed_data, cross); //Assuming more data, use the fitted cross product as the prior for a Normal distribution. apop_model *posterior = apop_update(moredata, .prior=xest, .likelihood=apop_normal); //Assuming more data, use the posterior as the prior for another updating round. apop_model *post2 = apop_update(moredata2, .prior=posterior, .likelihood=apop_normal); \endcode \li Writing your own models won't be covered in this introduction, but it can be easy to copy and modify the procedures of an existing model to fit your needs. When in doubt, delete a procedure, because any procedures that are missing will have defaults filled when used by functions like \ref apop_estimate (which uses \ref apop_maximum_likelihood) or \ref apop_cdf (which uses integration via random draws). See \ref modeldetails for details. \li There's a simple rule of thumb for remembering the order of the arguments to most of Apophenia's functions, including \ref apop_estimate : the data always comes first. Settings How many bins are in a histogram? At what tolerance does the maximum likelihood search end? What are the models being combined in an \ref apop_mixture distribution? Apophenia organizes settings in settings groups, which are then attached to models. In the following snippet demonstrating Bayesian updating, we specify a Beta distribution prior. If the likelihood function were a Binomial distribution, \ref apop_update knows the closed-form posterior for a Beta-Binomial pair, but in this case, with a PMF as a likelihood, it will have to run Markov chain Monte Carlo. The \ref apop_mcmc_settings group attached to the prior specifies details of how the run should work. For a likelihood, we generate an empirical distribution---a PMF---from an input data set, via apop_estimate(your_data, apop_pmf). When we call \ref apop_update on the last line, it already has all of the above info on hand. \code apop_model *beta = apop_model_set_parameters(apop_beta, 0.5, 0.25); Apop_settings_add_group(beta, apop_mcmc, .burnin = 0.2, .periods =1e5); apop_model *my_pmf = apop_estimate(your_data, apop_pmf); apop_model *posterior = apop_update(.prior= beta, .likelihood = my_pmf); \endcode Databases and models Returning to the introductory example, you saw that (1) the library expects you to keep your data in a database, pulling out the data as needed, and (2) that the workflow is built around \ref apop_model structures. Starting with (2), if a stats package has something called a model, then it is probably of the form Y = [an additive function of X], such as \f$y = x_1 + \log(x_2) + x_3^2\f$. Trying new models means trying different functional forms for the right-hand side, such as including \f$x_1\f$ in some cases and excluding it in others. Conversely, Apophenia is designed to facilitate trying new models in the broader sense of switching out a linear model for a hierarchical, or a Bayesian model for a simulation. A formula syntax makes little sense over such a broad range of models. As a result, the right-hand side is not part of the \ref apop_model. Instead, the data is assumed to be correctly formatted, scaled, or logged before being passed to the model. This is where part (1), the database, comes in, because it provides a proxy for the sort of formula specification language above: \code apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3, 2) from data"); apop_model *est = apop_estimate(testme, apop_ols); \endcode Generating factors and dummies is also considered data prep, not model internals. See \ref apop_data_to_dummies and \ref apop_data_to_factors. Now that you have \c est, an estimated model, you can interrogate it. This is where Apophenia and its encapsulated model objects shine, because you can do more than just admire the parameter estimates on the screen: you can take your estimated data set and fill in or generate new data, use it as an input to the parent distribution of a hierarchical model, et cetera. Some simple examples: \code //If you have a new data set with missing elements (represented by NaN), you can fill in predicted values: apop_predict(new_data_set, est); apop_data_print(new_data_set); //Fill a matrix with random draws. apop_data *d = apop_model_draws(est, .count=1000); //How does the AIC_c for this model compare to that of est2? printf("ΔAIC_c=%g\n", apop_data_get(est->info, .rowname="AIC_c") - apop_data_get(est2->info, .rowname="AIC_c")); \endcode \section gentle_end Conclusion This introduction has shown you the \ref apop_data set and some of the functions associated, which might be useful even if you aren't formally doing statistical work but do have to deal with data with real-world elements like column names and mixed numeric/text values. You've seen how Apophenia encapsulates many of a model's characteristics into a single \ref apop_model object, which you can send with data to functions like \ref apop_estimate, \ref apop_predict, or \ref apop_draw. Once you've got your data in the right form, you can use this to simply estimate model parameters, or as an input to later analysis. What's next? \li Check out the system for hypothesis testing, both with traditional known distributions (using \ref apop_test for dealing with Normal-, \f$t\f$-, \f$\chi^2\f$-distributed statistics); and for the parameters of any model; in \ref testpage. \li Try your own hand at putting new models into the \ref apop_model framework, as discussed in \ref modeldetails. \li For example, have a look at this blog and its subsequent posts, which wrap a microsimulation into an \ref apop_model, so that its parameters can be estimated and confidence intervals set around them. \li See the \ref maxipage page for discussion of the many features the optimization system has. It allows you to use a diverse set of search types on constrained or unconstrained models. \li Skim through the full list of macros and functions---there are hundreds---to get a sense of what else Apophenia offers. */ /** \page modeldetails Writing new models The \ref apop_model is intended to provide a consistent expression of any model that (implicitly or explicitly) expresses a likelihood of data given parameters, including traditional linear models, textbook distributions, Bayesian hierarchies, microsimulations, and any combination of the above. The unifying feature is that all of the models act over some data space and some parameter space (in some cases one or both is the empty set), and can assign a likelihood for a fixed pair of parameters and data given the model. This is a very broad requirement, often used in the statistical literature. For discussion of the theoretical structures, see A Useful Algebraic System of Statistical Models (PDF). This page is about writing new models from scratch, beginning with basic models and on up to models with arbitrary internal settings, specific methods of Bayesian updating using your model as a prior or likelihood, and so on. I assume you have already read \ref modelsec on using models and have tried a few things with the canned models that come with Apophenia, so you already know how a user handles basic estimation, adding a settings group, and so on. This page includes: \li \ref write_likelihoods of writing a new model from scratch. \li \ref settingswriting, covering the writing of ad hoc structures to hold model- or method-specific details, like the number of periods for burning in an MCMC run or the number of bins in a histogram. \li \ref vtables, covering the means of writing special-case routines for functions that are not part of the \ref apop_model itself, including the score or conjugate prior/likelihood pairs for \ref apop_update. \li \ref modeldataparts, a detailed list of the requirements for the non-function elements of an \ref apop_model. \li \ref methodsection, a detailed list of requirements for the function elements of an \ref apop_model. \section write_likelihoods A walkthrough Users are encouraged to always use models via the helper functions, like \ref apop_estimate or \ref apop_cdf. The helper functions do some boilerplate error checking, and call defaults as needed. For example, if your model has a \c log_likelihood method but no \c p method, then \ref apop_p will use exp(\c log_likelihood). If you don't give an \c estimate method, then \c apop_estimate will call \ref apop_maximum_likelihood. So the game in writing a new model is to write just enough internal methods to give the helper functions what they need. In the not-uncommon best case, all you need to do is write a log likelihood function or an RNG. Here is how one would set up a model that could be estimated using maximum likelihood: \li Write a likelihood function. Its header will look like this: \code long double new_log_likelihood(apop_data *data, apop_model *m); \endcode where \c data is the input data, and \c m is the parametrized model (i.e. your model with a \c parameters element already filled in by the caller). This function will return the value of the log likelihood function at the given parameters. \li Write the object: \code apop_model *your_new_model = &(apop_model){"The Me distribution", .vsize=n0, .msize1=n1, .msize2=n2, .dsize=nd, .log_likelihood = new_log_likelihood }; \endcode \li The first element is the .name, a human-language name for your model. \li The \c vsize, \c msize1, and \c msize2 elements specify the shape of the parameter set. For example, if there are three numbers in the vector, then set .vsize=3 and omit the matrix sizes. The default model prep routine will call new_est->parameters = apop_data_alloc(vsize, msize1, msize2). \li The \c dsize element is the size of one random draw from your model. \li It's common to have [the number of columns in your data set] parameters; this count will be filled in if you specify \c -1 for \c vsize, msize(1|2), or dsize. If the allocation is exceptional in a different way, then you will need to allocate parameters by writing a custom \c prep method for the model. \li Is this a constrained optimization? Add a .constraint element for those too. See \ref constr for more. You already have more than enough that something like this will work (the \c dsize is used for random draws): \code apop_model *estimated = apop_estimate(your_data, your_new_model); \endcode Once that baseline works, you can fill in other elements of the \ref apop_model as needed. For example, if you are using a maximum likelihood method to estimate parameters, you can get much faster estimates and better covariance estimates by specifying the dlog likelihood function (aka the score): \code void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){ //do algebra here to find df/dp0, df/dp1, df/dp2.... gsl_vector_set(gradient, 0, d_0); gsl_vector_set(gradient, 1, d_1); } \endcode The score is not part of the model object, but is registered (see below) using \code apop_score_insert(apop_new_dlog_likelihood, your_new_model); \endcode \subsection On Threading Many procedures in Apophenia use OpenMP to thread operations, so assume your functions are running in a threaded environment. If a method can not be threaded, wrap it in an OpenMP critical region. E.g., \code void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){ #pragma omp critical (newdlog) { //un-threadable algebra here } gsl_vector_set(gradient, 0, d_0); gsl_vector_set(gradient, 1, d_1); } \endcode \section settingswriting Writing new settings groups Your model may need additional settings or auxiliary information to function, which would require associating a model-specific struct with the model. A method associated with a model that uses such a struct usually begins with calls like \code long double ysg_ll(apop_data *d, apop_model *m){ ysg_settings *sets = apop_settings_get(m, ysg); ... } \endcode These model-specific structs are handled as expected by \ref apop_model_copy and \ref apop_model_free, and many functions that modify or transform an \ref apop_model try to handle settings groups as expected. This section describes how to build a settings group so all these automatic steps happen as expected, and your methods can reliably retrieve settings as needed. But before getting into the detail of how to make model-specific groups of settings work, note that there's a lightweight method of storing sundry settings, so in many cases you can bypass all of the following. The \ref apop_model structure has a \c void pointer named \c more which you can use to point to a model-specific struct. If \c more_size is larger than zero (i.e. you set it to your_model.more_size=sizeof(your_struct)), then it will be copied via \c memcpy by \ref apop_model_copy, and freed by \ref apop_model_free. Apophenia's routines will never impinge on this item, so do what you wish with it. The remainder of this subsection describes the information you'll have to provide to make use of the conveniences described to this point: initialization of defaults, smarter copying and freeing, and adding to an arbitrarily long list of settings groups attached to a model. You will need four items: a typedef for the structure itself, plus init, copy, and free functions. This is the sort of boilerplate that will be familiar to users of object-oriented languages in the style of C++ or Java, but it's really a list of arbitrarily-typed elements, which makes this feel more like LISP. [And being a reimplementation of an existing feature of LISP, this section will be macro-heavy.] The settings struct will likely go into a header file, so here is a sample header for a new settings group named \c ysg_settings, with a dataset, two sizes, and a reference counter. ysg stands for Your Settings Group; replace that substring with your preferred name in every instance to follow. \code typedef struct { int size1, size2; char *refs; apop_data *dataset; } ysg_settings; Apop_settings_declarations(ysg) \endcode The first item is a familiar structure definition. The last line is a macro that declares the init, copy, and free functions discussed below. This is everything you would need in a header file, should you need one. These are just declarations; we'll write the actual init/copy/free functions below. The structure itself gets the full name, \c ysg_settings. Everything else is a macro keyed on \c ysg, without the \c _settings part. Because of these macros, your struct name must end in \c _settings. If you have an especially simple structure, then you can generate the three functions with these three macros in your .c file: \code Apop_settings_init(ysg, ) Apop_settings_copy(ysg, ) Apop_settings_free(ysg, ) \endcode These macros generate appropriate functions to do what you'd expect: allocating the main structure, copying one struct to another, freeing the main structure. The spaces after the commas indicate that in these cases no special code gets added to the functions that these macros expand into. You'll never call the generated functions directly; they are called by \ref Apop_settings_add_group, \ref apop_model_free, and other model or settings-group handling functions. Now that initializing/copying/freeing of the structure itself is handled, the remainder of this section will be about how to add instructions for the structure internals, like data that is pointed to by the structure elements. \li For the allocate function, use the above form if everything in your code defaults to zero/\c NULL. Otherwise, you will need a new line declaring a default for every element in your structure. There is a macro to help with this too. These macros will define for your use a structure named \c in, and an output pointer-to-struct named \c out. Continuing the above example: \code Apop_settings_init (ysg, Apop_stopif(!in.size1, return NULL, 0, "I need you to give me a value for size1."); Apop_varad_set(size2, 10); Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2)); Apop_varad_set(refs, malloc(sizeof(int))); *refs=1; ) \endcode Now, Apop_settings_add(a_model, ysg, .size1=100) would set up a group with a 100-by-10 data set, and set the reference counter allocated and to one. \li Some functions do extensive internal copying, so you will need a copy function even if your code has no explicit calls to \ref apop_model_copy. The default above simply copies every element in the structure. Pointers are copied, giving you two pointers pointing to the same data. We have to be careful to prevent double-freeing later. \code //The elements of the set to copy are all copied by the function's boilerplate, //and then make one additional modification: Apop_settings_copy (ysg, #pragma omp critical (ysg_refs) (*refs)++; ) \endcode \li The struct itself is freed by boilerplate code, but add code in the free function to free data pointed to by pointers in the main structure. The macro defines a pointer-to-struct named \c in for your use. Continuing the example: \code Apop_settings_free (ysg, #pragma omp critical (ysg_refs) if (!(--in->refs)) { free(in->dataset); free(in->refs); } ) \endcode With those three macros in place and the header as above, Apophenia will treat your settings group like any other, and users can use \ref Apop_settings_add_group to populate it and attach it to any model. \section vtables Registering new methods in vtables The settings groups are for adding arbitrary model-specific nouns; vtables are for adding arbitrary model-specific verbs. Many functions (e.g., entropy, the dlog likelihood, Bayesian updating) have special cases for well-known models like the Normal distribution. Any function may maintain a registry of models and associated special-case procedures, aka a vtable. Lookups happen based on a hash that takes into account the elements of the model that will be used in the calculation. For example, the \c apop_update_hash takes in two models and calculates the hash based on the address of the prior's \c draw method and the likelihood's \c log_likelihood or \c p method. Thus, a vtable lookup for new models that re-use the same methods (at the same addresses in memory) will still find the same special-case function. If you need to deregister the function, use the associated deregister function, e.g. apop_update_vtable_drop(apop_beta, apop_binomial). You can guarantee that a method will not be re-added by following up the _drop with, e.g., apop_update_vtable_add(NULL, apop_beta, apop_binomial). The steps for adding a function to an existing vtable: \li See \ref apop_update, \ref apop_score, \ref apop_predict, \ref apop_model_print, and \ref apop_parameter_model for examples and procedure-specific details. \li Write a function following the given type definition, as listed in the function's documentation. \li Use the associated _vtable_add function to add the function and associate it with the given model. For example, to add a Beta-binomial routine named \c betabinom to the registry of Bayesian updating routines, use apop_update_vtable_add(betabinom, apop_beta, apop_binomial). \li Place a call to ..._vtable_add in the \c prep method of the given model, thus ensuring that the auxiliary functions are registered after the first time the model is sent to \ref apop_estimate. The easiest way to set up a new vtable is to copy/paste/modify an existing one. Briefly: \li See the existing setups in the vtables portion of apop.h. \li Cut/paste one and do a search and replace to change the name to match your desired use. \li Set the typedef to describe the functions that get added to the vtable. \li Rewrite the hash function to check the part of the inputs that interest you. For example, the update vtable associates functions with the \c draw, \c log_likelihood, and \p methods of the model. A model where these elements are identical will still match even if other elements are different. \section modeldataparts The data elements The remainder of this section covers the detailed expectations regarding the elements of the \ref apop_model structure. I begin with the data (non-function) elements, and then cover the method (function) elements. Some of the following will be requirements for all models and some will be advice to authors; I use the accepted definitions of "must", "shall", "may" and related words. \subsection datasubsec Data \li Each row of the \c data element is treated as a single observation by many functions. For example, \ref apop_bootstrap_cov depends on each row being an iid observation to function correctly. Calculating the Bayesian Information Criterion (BIC) requires knowing the number of observations in the data, and assumes that row count==observation count. For complex data, the \ref apop_data_pack and \ref apop_data_unpack functions can help with this. \li Some functions (bootstrap again, or many uses of \ref apop_kl_divergence) use \ref apop_draw to use your model's RNG (or a default) to draw a value, write it to the matrix element of the data set, and then move on to an estimation or other step. In this case, the data sent in will be entirely in the \c ->matrix element of the \ref apop_data set sent to model methods. Your \c likelihood, \c p, \c cdf, and \c estimate routines must accept data as a single row of the matrix of the \ref apop_data set for such functions to work. They may accept other formats. Tip: you can use \ref apop_data_pack and \ref apop_data_unpack to convert a structured set to a single row and back again. \li Your routines may accept other data formats, as per contract with the user. For example, regression-type functions use a function named \c ols_shuffle to convert a matrix where the first column is the dependent variable to a data set with dependent variable in the vector and a column of ones in the first matrix column. \subsection paramsubsec Parameters, vsize, msize1, msize2 \li The sizes will be used by the \c prep method of the model; see below. Given the model \c m and its elements \c m.vsize, \c m.msize1, \c m.msize2, functions that need to allocate a parameter set will do so via apop_data_alloc(m.vsize, m.msize1, m.msize2). \subsection infosubsec Info \li The first page, which should be named \c <info>, is typically a list of scalars. Nothing is guaranteed, but the elements may include: \li AIC: Aikake Information Criterion \li AIC_c: AIC with a finite sample correction. ``Generally, we advocate the use of AIC_c when the ratio \f$n/K\f$ is small (say \f$<\f$ 40)'' [Kenneth P. Burnham, David R. Anderson: Model Selection and Multi-Model Inference, p 66, emphasis in original.] \li BIC: Bayesian Information Criterion \li R squared \li R squared adj \li log likelihood \li status [0=OK, nozero=other]. For those elements that require a count of input data, the calculations assume each row in the input \ref apop_data set is a single datum. Get these via, e.g., apop_data_get(your_model->info, .rowname="log likelihood"). When writing for any arbitrary function, be prepared to handle \c NaN, indicating that the element is not calculated or saved in the info page by the given model. For OLS-type estimations, each row corresponds to the row in the original data. For filling in of missing data, the elements may appear anywhere, so the row/col indices are essential. \subsection settingsgroupmention settings, more In object-oriented jargon, settings groups are the private elements of the data set, to be pulled out in certain contexts, and ignored in all others. Therefore, there are no rules about internal use. The \c more element of the \ref apop_model provides a lightweight means of attaching an arbitrary struct to a model. See \ref settingswriting above for details. \li As many settings groups of different types as desired can be added to a single \ref apop_model. \li One \ref apop_model can not hold two settings groups of the same type. Re-additions cause the removal of the previous version of the group. \li If the \c more pointer points to a structure or value (let it be \c ss), then \c more_size must be set to sizeof(ss). \section methodsection Methods \subsection psubsection p, log_likelihood \li Function headers look like long double your_p_or_ll(apop_data *d, apop_model *params). \li The inputs are an \ref apop_data set and an \ref apop_model, which should include the elements needed to fully estimate the probability/likelihood (probably a filled ->parameters element, possibly a settings group added by the user). \li Assume that the parameters have been set, by users via \ref apop_estimate or \ref apop_model_set_parameters, or by \ref apop_maximum_likelihood by its search algorithms. If the parameters are necessary, the function shall check that the parameters are not \c NULL and set the model's \c error element to \c 'p' if they are missing. \li Return \c NaN on errors. If an error in the input model is found, the function may set the input model's \c error element to an appropriate \c char value. \li If your model includes both \c log_likelihood and \c p methods, it must be the case that log(p(d, m)) equals log_likelihood(d, m) for all \c d and \c m. This implies that \c p must return a value \f$\geq 0\f$. Note that \ref apop_maximum_likelihood will accept functions where \c p returns a negative value, but diagonstics that depend on log likelihood like AIC will return NaN. \li If observations are assumed to be iid, you may be able to use \ref apop_map_sum to write the core of the log likelihood function. \subsection prepsubsection prep \li Function header looks like void your_prep(apop_data *data, apop_model *params). \li Re-prepping a model after it has already been prepped shall have no effect. Where there is ambiguity with the other requirements, this takes precedence. \li The model's data pointer shall be set to point to the input data. \li The \c info element shall be allocated and its title set to \. \li If \c vsize, \c msize1, or \c msize2 are -1, then the prep function shall set them to the width of the input data. \li If \c dsize is -1, then the prep function shall set it to the width of the input data. \li If the \c parameters element is not allocated, the function shall allocate it via apop_data_alloc(vsize, msize1, msize2) (or equivalent). \li The default is \ref apop_model_clear. It does all of the above. \li The input data may be modified by the prep routine. For example, the \ref apop_ols prep routine shuffles a single input matrix as described above under \c data, and the \ref apop_pmf prep routine calls \ref apop_data_pmf_compress on the input data. \li The prep routine may initialize any desired settings groups. Unless otherwise stated, these should not be removed if they are already there, so that users can override defaults by adding a settings group before starting an estimation. \li If any functions associated with the model need to be added to a vtable (see above), the registration shall happen here. Registration may also happen elsewhere. \subsection estimatesubsection estimate \li Function header looks like void your_estimate(apop_data *data, apop_model *params). It modifies the input model, and returns nothing. Note that this is different from the wrapper function, \ref apop_estimate, which makes a copy of its input model, preps it, and then calls the \c estimate function with the prepeped copy. \li Assume that the prep routine has already been run. Notably, this means that parameters have been allocated. \li Assume that the \c parameters hold garbage (as in a \c malloc without a subsequent assignment to the malloc-ed space). \li The function shall set the \c parameters of the input model. For consistency with other models, the estimate should be the maximum likelihood estimate, unless otherwise documented. \li Additional settings may be set. \li The model's \c <Info> page may be filled with statistics, as discussed at infosubsec. For scalars like log likelihood and AIC, use \ref apop_data_add_named_elmt. \li Data should not be modified by the \c estimate routine; any changes to the data made by \c estimate must be documented. \li The default called by \ref apop_estimate is \ref apop_maximum_likelihood. \li If errors occur during processing, set the model's \c error element to a single character. Documentation should include the list of error characters and their meaning. \subsection drawsubsection draw \li Function header looks like void your_draw(double *out, gsl_rng* r, apop_model *params) \li Assume that model \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the draw method should check that \c parameters are not \c NULL if needed and fill the output with NaNs if necessary parameters are not set. \li Caller inputs a pointer-to-double of length \c dsize; user is expected to make sure that there is adequate space. Caller also inputs a \c gsl_rng, already allocated (probably via \ref apop_rng_alloc, possibly from \ref apop_rng_get_thread). \li The function shall fill the space pointed to by the input pointer with a random draw from the data space, where the likelihood of any given observation is proportional to its likelihood as given by the \c p method. Data shall be reduced to a single vector via \ref apop_data_pack if it is not already a single vector. \subsection cdfsubsection cdf \li Function header looks like long double your_cdf(apop_data *d, apop_model *params). \li Assume that \c parameters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the CDF method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set. \li The CDF method must accept data as a single row of data in the \c matrix of the input \ref apop_data set (as per a draw produced using the \c draw method). May accept other formats. \li Returns the percentage of the likelihood function \f$\leq\f$ the first row of the input data. The definition of \f$\leq\f$ is chosen by the model author. \li If one is not already present, an \c apop_cdf_settings group may be added to the model to store temp data. See the \ref apop_cdf function for details. \subsection constraintsubsection constraint \li Function header looks like long double your_constraint(apop_data *data, apop_model *params). \li Assume that \c parameters are set, via \ref apop_estimate, \ref apop_model_set_parameters, or the internals of an MLE search. The author of the constraint method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set. \li See \ref apop_linear_constraint for a useful basis and/or example. Many constraints can be written as wrappers for this function. \li If the constraint is met, then return zero. \li If the constraint fails, then (1) move the \c parameters in the input model to a constraint-satisfying value, and (2) return the distance between the input parameters and what you've moved the parameters to. The choice of within-bounds parameters and distance function is left to the author of the constraint function. */ /**\defgroup models This section is a detailed description of the stock models that ship with Apophenia. It is a reference. For an explanation of what to do with an \ref apop_model, see \ref modelsec. The primary questions one has about a model in practice are what format the input data should take and what to expect of an estimated output. Generally, the input data consists of an \ref apop_data set where each row is a single observation. Details beyond that are listed below. The output after running \ref apop_estimate to produce a fitted model are generally found in three places: the vector of the output parameter set, its matrix, or a new settings group. The basic intuition is that if the parameters are always a short list of scalars, they are in the vector; if there exists a situation where they could take matrix form, the parameters will be in the matrix; if they require more structure than that, they will be a settings group. If the basic structure of the \ref apop_data set is unfamiliar to you, see \ref dataoverview, which will discuss the basic means of getting data out of a struct. For example, the estimated \ref apop_normal distribution has the mean in position zero of the vector and the standard deviation in position one, so they could be extracted as follows: \code apop_data *d = apop_text_to_data("sample data from before") apop_model *out = apop_estimate(d, apop_normal); double mu = apop_data_get(out>parameters, 0); double sigma = apop_data_get(out>parameters, 1); //What is the p-value of test whose null hypothesis is that μ=3.3? printf ("pval=%g\n", apop_test(3.3, "normal", mu, sigma); \endcode See \ref modelsec for discussion of how to pull settings groups using \ref Apop_settings_get (for one item) or \ref apop_settings_get_group (for a full settings group). */ apophenia-1.0+ds/docs/down.png000066400000000000000000000005671262736346100163470ustar00rootroot00000000000000PNG  IHDR  &sRGBbKGD̿ pHYs  tIME2dGIDAT}J`OX"vܜA鬫+ ,Ί,}Q,J6ZbiZgq0""&OF3" L*BAF 6xͯ*At 'IׯwgbT-2 1K`N܆M!# ᩳRID% fh̞YBXxjW^#p8St,{q cr IENDB`apophenia-1.0+ds/docs/doxygen.conf.in000066400000000000000000003114021262736346100176140ustar00rootroot00000000000000# Doxyfile 1.8.9.1 # This file describes the settings to be used by the documentation system # doxygen (www.doxygen.org) for a project. # # All text after a double hash (##) is considered a comment and is placed in # front of the TAG it is preceding. # # All text after a single hash (#) is considered a comment and will be ignored. # The format is: # TAG = value [value, ...] # For lists, items can also be appended using: # TAG += value [value, ...] # Values that contain spaces should be placed between quotes (\" \"). #--------------------------------------------------------------------------- # Project related configuration options #--------------------------------------------------------------------------- # This tag specifies the encoding used for all characters in the config file # that follow. The default is UTF-8 which is also the encoding used for all text # before the first occurrence of this tag. Doxygen uses libiconv (or the iconv # built into libc) for the transcoding. See http://www.gnu.org/software/libiconv # for the list of possible encodings. # The default value is: UTF-8. DOXYFILE_ENCODING = UTF-8 # The PROJECT_NAME tag is a single word (or a sequence of words surrounded by # double-quotes, unless you are using Doxywizard) that should identify the # project for which the documentation is generated. This name is used in the # title of most generated pages and in a few other places. # The default value is: My Project. PROJECT_NAME = Apophenia # The PROJECT_NUMBER tag can be used to enter a project or revision number. This # could be handy for archiving the generated documentation or if some version # control system is used. PROJECT_NUMBER = # Using the PROJECT_BRIEF tag one can provide an optional one line description # for a project that appears at the top of each page and should give viewer a # quick idea about the purpose of the project. Keep the description short. PROJECT_BRIEF = # With the PROJECT_LOGO tag one can specify a logo or an icon that is included # in the documentation. The maximum height of the logo should not exceed 55 # pixels and the maximum width should not exceed 200 pixels. Doxygen will copy # the logo to the output directory. PROJECT_LOGO = # The OUTPUT_DIRECTORY tag is used to specify the (relative or absolute) path # into which the generated documentation will be written. If a relative path is # entered, it will be relative to the location where doxygen was started. If # left blank the current directory will be used. OUTPUT_DIRECTORY = # If the CREATE_SUBDIRS tag is set to YES then doxygen will create 4096 sub- # directories (in 2 levels) under the output directory of each output format and # will distribute the generated files over these directories. Enabling this # option can be useful when feeding doxygen a huge amount of source files, where # putting all generated files in the same directory would otherwise causes # performance problems for the file system. # The default value is: NO. CREATE_SUBDIRS = NO # If the ALLOW_UNICODE_NAMES tag is set to YES, doxygen will allow non-ASCII # characters to appear in the names of generated files. If set to NO, non-ASCII # characters will be escaped, for example _xE3_x81_x84 will be used for Unicode # U+3044. # The default value is: NO. ALLOW_UNICODE_NAMES = YES # The OUTPUT_LANGUAGE tag is used to specify the language in which all # documentation generated by doxygen is written. Doxygen will use this # information to generate all constant output in the proper language. # Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Catalan, Chinese, # Chinese-Traditional, Croatian, Czech, Danish, Dutch, English (United States), # Esperanto, Farsi (Persian), Finnish, French, German, Greek, Hungarian, # Indonesian, Italian, Japanese, Japanese-en (Japanese with English messages), # Korean, Korean-en (Korean with English messages), Latvian, Lithuanian, # Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese, Romanian, Russian, # Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish, Swedish, Turkish, # Ukrainian and Vietnamese. # The default value is: English. OUTPUT_LANGUAGE = English # If the BRIEF_MEMBER_DESC tag is set to YES, doxygen will include brief member # descriptions after the members that are listed in the file and class # documentation (similar to Javadoc). Set to NO to disable this. # The default value is: YES. BRIEF_MEMBER_DESC = YES # If the REPEAT_BRIEF tag is set to YES, doxygen will prepend the brief # description of a member or function before the detailed description # # Note: If both HIDE_UNDOC_MEMBERS and BRIEF_MEMBER_DESC are set to NO, the # brief descriptions will be completely suppressed. # The default value is: YES. REPEAT_BRIEF = YES # This tag implements a quasi-intelligent brief description abbreviator that is # used to form the text in various listings. Each string in this list, if found # as the leading text of the brief description, will be stripped from the text # and the result, after processing the whole list, is used as the annotated # text. Otherwise, the brief description is used as-is. If left blank, the # following values are used ($name is automatically replaced with the name of # the entity):The $name class, The $name widget, The $name file, is, provides, # specifies, contains, represents, a, an and the. ABBREVIATE_BRIEF = # If the ALWAYS_DETAILED_SEC and REPEAT_BRIEF tags are both set to YES then # doxygen will generate a detailed section even if there is only a brief # description. # The default value is: NO. ALWAYS_DETAILED_SEC = NO # If the INLINE_INHERITED_MEMB tag is set to YES, doxygen will show all # inherited members of a class in the documentation of that class as if those # members were ordinary class members. Constructors, destructors and assignment # operators of the base classes will not be shown. # The default value is: NO. INLINE_INHERITED_MEMB = NO # If the FULL_PATH_NAMES tag is set to YES, doxygen will prepend the full path # before files name in the file list and in the header files. If set to NO the # shortest path that makes the file name unique will be used # The default value is: YES. FULL_PATH_NAMES = NO # The STRIP_FROM_PATH tag can be used to strip a user-defined part of the path. # Stripping is only done if one of the specified strings matches the left-hand # part of the path. The tag can be used to show relative paths in the file list. # If left blank the directory from which doxygen is run is used as the path to # strip. # # Note that you can specify absolute paths here, but also relative paths, which # will be relative from the directory where doxygen is started. # This tag requires that the tag FULL_PATH_NAMES is set to YES. STRIP_FROM_PATH = # The STRIP_FROM_INC_PATH tag can be used to strip a user-defined part of the # path mentioned in the documentation of a class, which tells the reader which # header file to include in order to use a class. If left blank only the name of # the header file containing the class definition is used. Otherwise one should # specify the list of include paths that are normally passed to the compiler # using the -I flag. STRIP_FROM_INC_PATH = # If the SHORT_NAMES tag is set to YES, doxygen will generate much shorter (but # less readable) file names. This can be useful is your file systems doesn't # support long names like on DOS, Mac, or CD-ROM. # The default value is: NO. SHORT_NAMES = NO # If the JAVADOC_AUTOBRIEF tag is set to YES then doxygen will interpret the # first line (until the first dot) of a Javadoc-style comment as the brief # description. If set to NO, the Javadoc-style will behave just like regular Qt- # style comments (thus requiring an explicit @brief command for a brief # description.) # The default value is: NO. JAVADOC_AUTOBRIEF = NO # If the QT_AUTOBRIEF tag is set to YES then doxygen will interpret the first # line (until the first dot) of a Qt-style comment as the brief description. If # set to NO, the Qt-style will behave just like regular Qt-style comments (thus # requiring an explicit \brief command for a brief description.) # The default value is: NO. QT_AUTOBRIEF = NO # The MULTILINE_CPP_IS_BRIEF tag can be set to YES to make doxygen treat a # multi-line C++ special comment block (i.e. a block of //! or /// comments) as # a brief description. This used to be the default behavior. The new default is # to treat a multi-line C++ comment block as a detailed description. Set this # tag to YES if you prefer the old behavior instead. # # Note that setting this tag to YES also means that rational rose comments are # not recognized any more. # The default value is: NO. MULTILINE_CPP_IS_BRIEF = NO # If the INHERIT_DOCS tag is set to YES then an undocumented member inherits the # documentation from any documented member that it re-implements. # The default value is: YES. INHERIT_DOCS = YES # If the SEPARATE_MEMBER_PAGES tag is set to YES then doxygen will produce a new # page for each member. If set to NO, the documentation of a member will be part # of the file/class/namespace that contains it. # The default value is: NO. SEPARATE_MEMBER_PAGES = NO # The TAB_SIZE tag can be used to set the number of spaces in a tab. Doxygen # uses this value to replace tabs by spaces in code fragments. # Minimum value: 1, maximum value: 16, default value: 4. TAB_SIZE = 5 # This tag can be used to specify a number of aliases that act as commands in # the documentation. An alias has the form: # name=value # For example adding # "sideeffect=@par Side Effects:\n" # will allow you to put the command \sideeffect (or @sideeffect) in the # documentation, which will result in a user-defined paragraph with heading # "Side Effects:". You can put \n's in the value part of an alias to insert # newlines. ALIASES = # This tag can be used to specify a number of word-keyword mappings (TCL only). # A mapping has the form "name=value". For example adding "class=itcl::class" # will allow you to use the command class in the itcl::class meaning. TCL_SUBST = # Set the OPTIMIZE_OUTPUT_FOR_C tag to YES if your project consists of C sources # only. Doxygen will then generate output that is more tailored for C. For # instance, some of the names that are used will be different. The list of all # members will be omitted, etc. # The default value is: NO. OPTIMIZE_OUTPUT_FOR_C = YES # Set the OPTIMIZE_OUTPUT_JAVA tag to YES if your project consists of Java or # Python sources only. Doxygen will then generate output that is more tailored # for that language. For instance, namespaces will be presented as packages, # qualified scopes will look different, etc. # The default value is: NO. OPTIMIZE_OUTPUT_JAVA = NO # Set the OPTIMIZE_FOR_FORTRAN tag to YES if your project consists of Fortran # sources. Doxygen will then generate output that is tailored for Fortran. # The default value is: NO. OPTIMIZE_FOR_FORTRAN = NO # Set the OPTIMIZE_OUTPUT_VHDL tag to YES if your project consists of VHDL # sources. Doxygen will then generate output that is tailored for VHDL. # The default value is: NO. OPTIMIZE_OUTPUT_VHDL = NO # Doxygen selects the parser to use depending on the extension of the files it # parses. With this tag you can assign which parser to use for a given # extension. Doxygen has a built-in mapping, but you can override or extend it # using this tag. The format is ext=language, where ext is a file extension, and # language is one of the parsers supported by doxygen: IDL, Java, Javascript, # C#, C, C++, D, PHP, Objective-C, Python, Fortran (fixed format Fortran: # FortranFixed, free formatted Fortran: FortranFree, unknown formatted Fortran: # Fortran. In the later case the parser tries to guess whether the code is fixed # or free formatted code, this is the default for Fortran type files), VHDL. For # instance to make doxygen treat .inc files as Fortran files (default is PHP), # and .f files as C (default is Fortran), use: inc=Fortran f=C. # # Note: For files without extension you can use no_extension as a placeholder. # # Note that for custom extensions you also need to set FILE_PATTERNS otherwise # the files are not read by doxygen. EXTENSION_MAPPING = # If the MARKDOWN_SUPPORT tag is enabled then doxygen pre-processes all comments # according to the Markdown format, which allows for more readable # documentation. See http://daringfireball.net/projects/markdown/ for details. # The output of markdown processing is further processed by doxygen, so you can # mix doxygen, HTML, and XML commands with Markdown formatting. Disable only in # case of backward compatibilities issues. # The default value is: YES. MARKDOWN_SUPPORT = YES # When enabled doxygen tries to link words that correspond to documented # classes, or namespaces to their corresponding documentation. Such a link can # be prevented in individual cases by putting a % sign in front of the word or # globally by setting AUTOLINK_SUPPORT to NO. # The default value is: YES. AUTOLINK_SUPPORT = YES # If you use STL classes (i.e. std::string, std::vector, etc.) but do not want # to include (a tag file for) the STL sources as input, then you should set this # tag to YES in order to let doxygen match functions declarations and # definitions whose arguments contain STL classes (e.g. func(std::string); # versus func(std::string) {}). This also make the inheritance and collaboration # diagrams that involve STL classes more complete and accurate. # The default value is: NO. BUILTIN_STL_SUPPORT = NO # If you use Microsoft's C++/CLI language, you should set this option to YES to # enable parsing support. # The default value is: NO. CPP_CLI_SUPPORT = NO # Set the SIP_SUPPORT tag to YES if your project consists of sip (see: # http://www.riverbankcomputing.co.uk/software/sip/intro) sources only. Doxygen # will parse them like normal C++ but will assume all classes use public instead # of private inheritance when no explicit protection keyword is present. # The default value is: NO. SIP_SUPPORT = NO # For Microsoft's IDL there are propget and propput attributes to indicate # getter and setter methods for a property. Setting this option to YES will make # doxygen to replace the get and set methods by a property in the documentation. # This will only work if the methods are indeed getting or setting a simple # type. If this is not the case, or you want to show the methods anyway, you # should set this option to NO. # The default value is: YES. IDL_PROPERTY_SUPPORT = YES # If member grouping is used in the documentation and the DISTRIBUTE_GROUP_DOC # tag is set to YES then doxygen will reuse the documentation of the first # member in the group (if any) for the other members of the group. By default # all members of a group must be documented explicitly. # The default value is: NO. DISTRIBUTE_GROUP_DOC = NO # Set the SUBGROUPING tag to YES to allow class member groups of the same type # (for instance a group of public functions) to be put as a subgroup of that # type (e.g. under the Public Functions section). Set it to NO to prevent # subgrouping. Alternatively, this can be done per class using the # \nosubgrouping command. # The default value is: YES. SUBGROUPING = YES # When the INLINE_GROUPED_CLASSES tag is set to YES, classes, structs and unions # are shown inside the group in which they are included (e.g. using \ingroup) # instead of on a separate page (for HTML and Man pages) or section (for LaTeX # and RTF). # # Note that this feature does not work in combination with # SEPARATE_MEMBER_PAGES. # The default value is: NO. INLINE_GROUPED_CLASSES = NO # When the INLINE_SIMPLE_STRUCTS tag is set to YES, structs, classes, and unions # with only public data fields or simple typedef fields will be shown inline in # the documentation of the scope in which they are defined (i.e. file, # namespace, or group documentation), provided this scope is documented. If set # to NO, structs, classes, and unions are shown on a separate page (for HTML and # Man pages) or section (for LaTeX and RTF). # The default value is: NO. INLINE_SIMPLE_STRUCTS = NO # When TYPEDEF_HIDES_STRUCT tag is enabled, a typedef of a struct, union, or # enum is documented as struct, union, or enum with the name of the typedef. So # typedef struct TypeS {} TypeT, will appear in the documentation as a struct # with name TypeT. When disabled the typedef will appear as a member of a file, # namespace, or class. And the struct will be named TypeS. This can typically be # useful for C code in case the coding convention dictates that all compound # types are typedef'ed and only the typedef is referenced, never the tag name. # The default value is: NO. TYPEDEF_HIDES_STRUCT = YEs # The size of the symbol lookup cache can be set using LOOKUP_CACHE_SIZE. This # cache is used to resolve symbols given their name and scope. Since this can be # an expensive process and often the same symbol appears multiple times in the # code, doxygen keeps a cache of pre-resolved symbols. If the cache is too small # doxygen will become slower. If the cache is too large, memory is wasted. The # cache size is given by this formula: 2^(16+LOOKUP_CACHE_SIZE). The valid range # is 0..9, the default is 0, corresponding to a cache size of 2^16=65536 # symbols. At the end of a run doxygen will report the cache usage and suggest # the optimal cache size from a speed point of view. # Minimum value: 0, maximum value: 9, default value: 0. LOOKUP_CACHE_SIZE = 0 #--------------------------------------------------------------------------- # Build related configuration options #--------------------------------------------------------------------------- # If the EXTRACT_ALL tag is set to YES, doxygen will assume all entities in # documentation are documented, even if no documentation was available. Private # class members and static file members will be hidden unless the # EXTRACT_PRIVATE respectively EXTRACT_STATIC tags are set to YES. # Note: This will also disable the warnings about undocumented members that are # normally produced when WARNINGS is set to YES. # The default value is: NO. EXTRACT_ALL = NO # If the EXTRACT_PRIVATE tag is set to YES, all private members of a class will # be included in the documentation. # The default value is: NO. EXTRACT_PRIVATE = NO # If the EXTRACT_PACKAGE tag is set to YES, all members with package or internal # scope will be included in the documentation. # The default value is: NO. EXTRACT_PACKAGE = NO # If the EXTRACT_STATIC tag is set to YES, all static members of a file will be # included in the documentation. # The default value is: NO. EXTRACT_STATIC = NO # If the EXTRACT_LOCAL_CLASSES tag is set to YES, classes (and structs) defined # locally in source files will be included in the documentation. If set to NO, # only classes defined in header files are included. Does not have any effect # for Java sources. # The default value is: YES. EXTRACT_LOCAL_CLASSES = YES # This flag is only useful for Objective-C code. If set to YES, local methods, # which are defined in the implementation section but not in the interface are # included in the documentation. If set to NO, only methods in the interface are # included. # The default value is: NO. EXTRACT_LOCAL_METHODS = NO # If this flag is set to YES, the members of anonymous namespaces will be # extracted and appear in the documentation as a namespace called # 'anonymous_namespace{file}', where file will be replaced with the base name of # the file that contains the anonymous namespace. By default anonymous namespace # are hidden. # The default value is: NO. EXTRACT_ANON_NSPACES = NO # If the HIDE_UNDOC_MEMBERS tag is set to YES, doxygen will hide all # undocumented members inside documented classes or files. If set to NO these # members will be included in the various overviews, but no documentation # section is generated. This option has no effect if EXTRACT_ALL is enabled. # The default value is: NO. HIDE_UNDOC_MEMBERS = NO # If the HIDE_UNDOC_CLASSES tag is set to YES, doxygen will hide all # undocumented classes that are normally visible in the class hierarchy. If set # to NO, these classes will be included in the various overviews. This option # has no effect if EXTRACT_ALL is enabled. # The default value is: NO. HIDE_UNDOC_CLASSES = NO # If the HIDE_FRIEND_COMPOUNDS tag is set to YES, doxygen will hide all friend # (class|struct|union) declarations. If set to NO, these declarations will be # included in the documentation. # The default value is: NO. HIDE_FRIEND_COMPOUNDS = NO # If the HIDE_IN_BODY_DOCS tag is set to YES, doxygen will hide any # documentation blocks found inside the body of a function. If set to NO, these # blocks will be appended to the function's detailed documentation block. # The default value is: NO. HIDE_IN_BODY_DOCS = NO # The INTERNAL_DOCS tag determines if documentation that is typed after a # \internal command is included. If the tag is set to NO then the documentation # will be excluded. Set it to YES to include the internal documentation. # The default value is: NO. INTERNAL_DOCS = NO # If the CASE_SENSE_NAMES tag is set to NO then doxygen will only generate file # names in lower-case letters. If set to YES, upper-case letters are also # allowed. This is useful if you have classes or files whose names only differ # in case and if your file system supports case sensitive file names. Windows # and Mac users are advised to set this option to NO. # The default value is: system dependent. CASE_SENSE_NAMES = YES # If the HIDE_SCOPE_NAMES tag is set to NO then doxygen will show members with # their full class and namespace scopes in the documentation. If set to YES, the # scope will be hidden. # The default value is: NO. HIDE_SCOPE_NAMES = NO # If the HIDE_COMPOUND_REFERENCE tag is set to NO (default) then doxygen will # append additional text to a page's title, such as Class Reference. If set to # YES the compound reference will be hidden. # The default value is: NO. HIDE_COMPOUND_REFERENCE= NO # If the SHOW_INCLUDE_FILES tag is set to YES then doxygen will put a list of # the files that are included by a file in the documentation of that file. # The default value is: YES. SHOW_INCLUDE_FILES = NO # If the SHOW_GROUPED_MEMB_INC tag is set to YES then Doxygen will add for each # grouped member an include statement to the documentation, telling the reader # which file to include in order to use the member. # The default value is: NO. SHOW_GROUPED_MEMB_INC = NO # If the FORCE_LOCAL_INCLUDES tag is set to YES then doxygen will list include # files with double quotes in the documentation rather than with sharp brackets. # The default value is: NO. FORCE_LOCAL_INCLUDES = NO # If the INLINE_INFO tag is set to YES then a tag [inline] is inserted in the # documentation for inline members. # The default value is: YES. INLINE_INFO = YES # If the SORT_MEMBER_DOCS tag is set to YES then doxygen will sort the # (detailed) documentation of file and class members alphabetically by member # name. If set to NO, the members will appear in declaration order. # The default value is: YES. SORT_MEMBER_DOCS = YES # If the SORT_BRIEF_DOCS tag is set to YES then doxygen will sort the brief # descriptions of file, namespace and class members alphabetically by member # name. If set to NO, the members will appear in declaration order. Note that # this will also influence the order of the classes in the class list. # The default value is: NO. SORT_BRIEF_DOCS = YES # If the SORT_MEMBERS_CTORS_1ST tag is set to YES then doxygen will sort the # (brief and detailed) documentation of class members so that constructors and # destructors are listed first. If set to NO the constructors will appear in the # respective orders defined by SORT_BRIEF_DOCS and SORT_MEMBER_DOCS. # Note: If SORT_BRIEF_DOCS is set to NO this option is ignored for sorting brief # member documentation. # Note: If SORT_MEMBER_DOCS is set to NO this option is ignored for sorting # detailed member documentation. # The default value is: NO. SORT_MEMBERS_CTORS_1ST = NO # If the SORT_GROUP_NAMES tag is set to YES then doxygen will sort the hierarchy # of group names into alphabetical order. If set to NO the group names will # appear in their defined order. # The default value is: NO. SORT_GROUP_NAMES = NO # If the SORT_BY_SCOPE_NAME tag is set to YES, the class list will be sorted by # fully-qualified names, including namespaces. If set to NO, the class list will # be sorted only by class name, not including the namespace part. # Note: This option is not very useful if HIDE_SCOPE_NAMES is set to YES. # Note: This option applies only to the class list, not to the alphabetical # list. # The default value is: NO. SORT_BY_SCOPE_NAME = NO # If the STRICT_PROTO_MATCHING option is enabled and doxygen fails to do proper # type resolution of all parameters of a function it will reject a match between # the prototype and the implementation of a member function even if there is # only one candidate or it is obvious which candidate to choose by doing a # simple string match. By disabling STRICT_PROTO_MATCHING doxygen will still # accept a match between prototype and implementation in such cases. # The default value is: NO. STRICT_PROTO_MATCHING = NO # The GENERATE_TODOLIST tag can be used to enable (YES) or disable (NO) the todo # list. This list is created by putting \todo commands in the documentation. # The default value is: YES. GENERATE_TODOLIST = NO # The GENERATE_TESTLIST tag can be used to enable (YES) or disable (NO) the test # list. This list is created by putting \test commands in the documentation. # The default value is: YES. GENERATE_TESTLIST = YES # The GENERATE_BUGLIST tag can be used to enable (YES) or disable (NO) the bug # list. This list is created by putting \bug commands in the documentation. # The default value is: YES. GENERATE_BUGLIST = YES # The GENERATE_DEPRECATEDLIST tag can be used to enable (YES) or disable (NO) # the deprecated list. This list is created by putting \deprecated commands in # the documentation. # The default value is: YES. GENERATE_DEPRECATEDLIST= NO # The ENABLED_SECTIONS tag can be used to enable conditional documentation # sections, marked by \if ... \endif and \cond # ... \endcond blocks. ENABLED_SECTIONS = # The MAX_INITIALIZER_LINES tag determines the maximum number of lines that the # initial value of a variable or macro / define can have for it to appear in the # documentation. If the initializer consists of more lines than specified here # it will be hidden. Use a value of 0 to hide initializers completely. The # appearance of the value of individual variables and macros / defines can be # controlled using \showinitializer or \hideinitializer command in the # documentation regardless of this setting. # Minimum value: 0, maximum value: 10000, default value: 30. MAX_INITIALIZER_LINES = 0 # Set the SHOW_USED_FILES tag to NO to disable the list of files generated at # the bottom of the documentation of classes and structs. If set to YES, the # list will mention the files that were used to generate the documentation. # The default value is: YES. SHOW_USED_FILES = NO # Set the SHOW_FILES tag to NO to disable the generation of the Files page. This # will remove the Files entry from the Quick Index and from the Folder Tree View # (if specified). # The default value is: YES. SHOW_FILES = NO # Set the SHOW_NAMESPACES tag to NO to disable the generation of the Namespaces # page. This will remove the Namespaces entry from the Quick Index and from the # Folder Tree View (if specified). # The default value is: YES. SHOW_NAMESPACES = NO # The FILE_VERSION_FILTER tag can be used to specify a program or script that # doxygen should invoke to get the current version for each file (typically from # the version control system). Doxygen will invoke the program by executing (via # popen()) the command command input-file, where command is the value of the # FILE_VERSION_FILTER tag, and input-file is the name of an input file provided # by doxygen. Whatever the program writes to standard output is used as the file # version. For an example see the documentation. FILE_VERSION_FILTER = # The LAYOUT_FILE tag can be used to specify a layout file which will be parsed # by doxygen. The layout file controls the global structure of the generated # output files in an output format independent way. To create the layout file # that represents doxygen's defaults, run doxygen with the -l option. You can # optionally specify a file name after the option, if omitted DoxygenLayout.xml # will be used as the name of the layout file. # # Note that if you run doxygen from a directory containing a file called # DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE # tag is left empty. LAYOUT_FILE = # The CITE_BIB_FILES tag can be used to specify one or more bib files containing # the reference definitions. This must be a list of .bib files. The .bib # extension is automatically appended if omitted. This requires the bibtex tool # to be installed. See also http://en.wikipedia.org/wiki/BibTeX for more info. # For LaTeX the style of the bibliography can be controlled using # LATEX_BIB_STYLE. To use this feature you need bibtex and perl available in the # search path. See also \cite for info how to create references. CITE_BIB_FILES = #--------------------------------------------------------------------------- # Configuration options related to warning and progress messages #--------------------------------------------------------------------------- # The QUIET tag can be used to turn on/off the messages that are generated to # standard output by doxygen. If QUIET is set to YES this implies that the # messages are off. # The default value is: NO. QUIET = YES # The WARNINGS tag can be used to turn on/off the warning messages that are # generated to standard error (stderr) by doxygen. If WARNINGS is set to YES # this implies that the warnings are on. # # Tip: Turn warnings on while writing the documentation. # The default value is: YES. WARNINGS = YES # If the WARN_IF_UNDOCUMENTED tag is set to YES then doxygen will generate # warnings for undocumented members. If EXTRACT_ALL is set to YES then this flag # will automatically be disabled. # The default value is: YES. WARN_IF_UNDOCUMENTED = NO # If the WARN_IF_DOC_ERROR tag is set to YES, doxygen will generate warnings for # potential errors in the documentation, such as not documenting some parameters # in a documented function, or documenting parameters that don't exist or using # markup commands wrongly. # The default value is: YES. WARN_IF_DOC_ERROR = YES # This WARN_NO_PARAMDOC option can be enabled to get warnings for functions that # are documented, but have no documentation for their parameters or return # value. If set to NO, doxygen will only warn about wrong or incomplete # parameter documentation, but not about the absence of documentation. # The default value is: NO. WARN_NO_PARAMDOC = NO # The WARN_FORMAT tag determines the format of the warning messages that doxygen # can produce. The string should contain the $file, $line, and $text tags, which # will be replaced by the file and line number from which the warning originated # and the warning text. Optionally the format may contain $version, which will # be replaced by the version of the file (if it could be obtained via # FILE_VERSION_FILTER) # The default value is: $file:$line: $text. WARN_FORMAT = "$file:$line: $text" # The WARN_LOGFILE tag can be used to specify a file to which warning and error # messages should be written. If left blank the output is written to standard # error (stderr). WARN_LOGFILE = doxygen.log #--------------------------------------------------------------------------- # Configuration options related to the input files #--------------------------------------------------------------------------- # The INPUT tag is used to specify the files and/or directories that contain # documented source files. You may enter file names like myfile.cpp or # directories like /usr/src/myproject. Separate the files or directories with # spaces. # Note: If this tag is empty the current directory is searched. INPUT = @abs_top_builddir@/docs/include @abs_top_srcdir@ # This tag can be used to specify the character encoding of the source files # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses # libiconv (or the iconv built into libc) for the transcoding. See the libiconv # documentation (see: http://www.gnu.org/software/libiconv) for the list of # possible encodings. # The default value is: UTF-8. INPUT_ENCODING = UTF-8 # If the value of the INPUT tag contains directories, you can use the # FILE_PATTERNS tag to specify one or more wildcard patterns (like *.cpp and # *.h) to filter out the source-files in the directories. If left blank the # following patterns are tested:*.c, *.cc, *.cxx, *.cpp, *.c++, *.java, *.ii, # *.ixx, *.ipp, *.i++, *.inl, *.idl, *.ddl, *.odl, *.h, *.hh, *.hxx, *.hpp, # *.h++, *.cs, *.d, *.php, *.php4, *.php5, *.phtml, *.inc, *.m, *.markdown, # *.md, *.mm, *.dox, *.py, *.f90, *.f, *.for, *.tcl, *.vhd, *.vhdl, *.ucf, # *.qsf, *.as and *.js. FILE_PATTERNS = # The RECURSIVE tag can be used to specify whether or not subdirectories should # be searched for input files as well. # The default value is: NO. RECURSIVE = YES # The EXCLUDE tag can be used to specify files and/or directories that should be # excluded from the INPUT source files. This way you can easily exclude a # subdirectory from a directory tree whose root is specified with the INPUT tag. # # Note that relative paths are relative to the directory from which doxygen is # run. EXCLUDE = # The EXCLUDE_SYMLINKS tag can be used to select whether or not files or # directories that are symbolic links (a Unix file system feature) are excluded # from the input. # The default value is: NO. EXCLUDE_SYMLINKS = NO # If the value of the INPUT tag contains directories, you can use the # EXCLUDE_PATTERNS tag to specify one or more wildcard patterns to exclude # certain files from those directories. # # Note that the wildcards are matched against the file with absolute path, so to # exclude all test directories for example use the pattern */test/* EXCLUDE_PATTERNS = # The EXCLUDE_SYMBOLS tag can be used to specify one or more symbol names # (namespaces, classes, functions, etc.) that should be excluded from the # output. The symbol name can be a fully qualified name, a word, or if the # wildcard * is used, a substring. Examples: ANamespace, AClass, # AClass::ANamespace, ANamespace::*Test # # Note that the wildcards are matched against the file with absolute path, so to # exclude all test directories use the pattern */test/* EXCLUDE_SYMBOLS = # The EXAMPLE_PATH tag can be used to specify one or more files or directories # that contain example code fragments that are included (see the \include # command). EXAMPLE_PATH = @abs_top_srcdir@ # If the value of the EXAMPLE_PATH tag contains directories, you can use the # EXAMPLE_PATTERNS tag to specify one or more wildcard pattern (like *.cpp and # *.h) to filter out the source-files in the directories. If left blank all # files are included. EXAMPLE_PATTERNS = # If the EXAMPLE_RECURSIVE tag is set to YES then subdirectories will be # searched for input files to be used with the \include or \dontinclude commands # irrespective of the value of the RECURSIVE tag. # The default value is: NO. EXAMPLE_RECURSIVE = YES # The IMAGE_PATH tag can be used to specify one or more files or directories # that contain images that are to be included in the documentation (see the # \image command). IMAGE_PATH = . # The INPUT_FILTER tag can be used to specify a program that doxygen should # invoke to filter for each input file. Doxygen will invoke the filter program # by executing (via popen()) the command: # # # # where is the value of the INPUT_FILTER tag, and is the # name of an input file. Doxygen will then use the output that the filter # program writes to standard output. If FILTER_PATTERNS is specified, this tag # will be ignored. # # Note that the filter must not add or remove lines; it is applied before the # code is scanned, but not when the output code is generated. If lines are added # or removed, the anchors will not be placed correctly. INPUT_FILTER = # The FILTER_PATTERNS tag can be used to specify filters on a per file pattern # basis. Doxygen will compare the file name with each pattern and apply the # filter if there is a match. The filters are a list of the form: pattern=filter # (like *.cpp=my_cpp_filter). See INPUT_FILTER for further information on how # filters are used. If the FILTER_PATTERNS tag is empty or if none of the # patterns match the file name, INPUT_FILTER is applied. FILTER_PATTERNS = # If the FILTER_SOURCE_FILES tag is set to YES, the input filter (if set using # INPUT_FILTER) will also be used to filter the input files that are used for # producing the source files to browse (i.e. when SOURCE_BROWSER is set to YES). # The default value is: NO. FILTER_SOURCE_FILES = NO # The FILTER_SOURCE_PATTERNS tag can be used to specify source filters per file # pattern. A pattern will override the setting for FILTER_PATTERN (if any) and # it is also possible to disable source filtering for a specific pattern using # *.ext= (so without naming a filter). # This tag requires that the tag FILTER_SOURCE_FILES is set to YES. FILTER_SOURCE_PATTERNS = # If the USE_MDFILE_AS_MAINPAGE tag refers to the name of a markdown file that # is part of the input, its contents will be placed on the main page # (index.html). This can be useful if you have a project on for instance GitHub # and want to reuse the introduction page also for the doxygen output. USE_MDFILE_AS_MAINPAGE = #--------------------------------------------------------------------------- # Configuration options related to source browsing #--------------------------------------------------------------------------- # If the SOURCE_BROWSER tag is set to YES then a list of source files will be # generated. Documented entities will be cross-referenced with these sources. # # Note: To get rid of all source code in the generated output, make sure that # also VERBATIM_HEADERS is set to NO. # The default value is: NO. SOURCE_BROWSER = NO # Setting the INLINE_SOURCES tag to YES will include the body of functions, # classes and enums directly into the documentation. # The default value is: NO. INLINE_SOURCES = NO # Setting the STRIP_CODE_COMMENTS tag to YES will instruct doxygen to hide any # special comment blocks from generated source code fragments. Normal C, C++ and # Fortran comments will always remain visible. # The default value is: YES. STRIP_CODE_COMMENTS = YES # If the REFERENCED_BY_RELATION tag is set to YES then for each documented # function all documented functions referencing it will be listed. # The default value is: NO. REFERENCED_BY_RELATION = NO # If the REFERENCES_RELATION tag is set to YES then for each documented function # all documented entities called/used by that function will be listed. # The default value is: NO. REFERENCES_RELATION = NO # If the REFERENCES_LINK_SOURCE tag is set to YES and SOURCE_BROWSER tag is set # to YES then the hyperlinks from functions in REFERENCES_RELATION and # REFERENCED_BY_RELATION lists will link to the source code. Otherwise they will # link to the documentation. # The default value is: YES. REFERENCES_LINK_SOURCE = YES # If SOURCE_TOOLTIPS is enabled (the default) then hovering a hyperlink in the # source code will show a tooltip with additional information such as prototype, # brief description and links to the definition and documentation. Since this # will make the HTML file larger and loading of large files a bit slower, you # can opt to disable this feature. # The default value is: YES. # This tag requires that the tag SOURCE_BROWSER is set to YES. SOURCE_TOOLTIPS = YES # If the USE_HTAGS tag is set to YES then the references to source code will # point to the HTML generated by the htags(1) tool instead of doxygen built-in # source browser. The htags tool is part of GNU's global source tagging system # (see http://www.gnu.org/software/global/global.html). You will need version # 4.8.6 or higher. # # To use it do the following: # - Install the latest version of global # - Enable SOURCE_BROWSER and USE_HTAGS in the config file # - Make sure the INPUT points to the root of the source tree # - Run doxygen as normal # # Doxygen will invoke htags (and that will in turn invoke gtags), so these # tools must be available from the command line (i.e. in the search path). # # The result: instead of the source browser generated by doxygen, the links to # source code will now point to the output of htags. # The default value is: NO. # This tag requires that the tag SOURCE_BROWSER is set to YES. USE_HTAGS = NO # If the VERBATIM_HEADERS tag is set the YES then doxygen will generate a # verbatim copy of the header file for each class for which an include is # specified. Set to NO to disable this. # See also: Section \class. # The default value is: YES. VERBATIM_HEADERS = YES #--------------------------------------------------------------------------- # Configuration options related to the alphabetical class index #--------------------------------------------------------------------------- # If the ALPHABETICAL_INDEX tag is set to YES, an alphabetical index of all # compounds will be generated. Enable this if the project contains a lot of # classes, structs, unions or interfaces. # The default value is: YES. ALPHABETICAL_INDEX = YES # The COLS_IN_ALPHA_INDEX tag can be used to specify the number of columns in # which the alphabetical index list will be split. # Minimum value: 1, maximum value: 20, default value: 5. # This tag requires that the tag ALPHABETICAL_INDEX is set to YES. COLS_IN_ALPHA_INDEX = 3 # In case all classes in a project start with a common prefix, all classes will # be put under the same header in the alphabetical index. The IGNORE_PREFIX tag # can be used to specify a prefix (or a list of prefixes) that should be ignored # while generating the index headers. # This tag requires that the tag ALPHABETICAL_INDEX is set to YES. IGNORE_PREFIX = #--------------------------------------------------------------------------- # Configuration options related to the HTML output #--------------------------------------------------------------------------- # If the GENERATE_HTML tag is set to YES, doxygen will generate HTML output # The default value is: YES. GENERATE_HTML = YES # The HTML_OUTPUT tag is used to specify where the HTML docs will be put. If a # relative path is entered the value of OUTPUT_DIRECTORY will be put in front of # it. # The default directory is: html. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_OUTPUT = html # The HTML_FILE_EXTENSION tag can be used to specify the file extension for each # generated HTML page (for example: .htm, .php, .asp). # The default value is: .html. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_FILE_EXTENSION = .html # The HTML_HEADER tag can be used to specify a user-defined HTML header file for # each generated HTML page. If the tag is left blank doxygen will generate a # standard header. # # To get valid HTML the header file that includes any scripts and style sheets # that doxygen needs, which is dependent on the configuration options used (e.g. # the setting GENERATE_TREEVIEW). It is highly recommended to start with a # default header using # doxygen -w html new_header.html new_footer.html new_stylesheet.css # YourConfigFile # and then modify the file new_header.html. See also section "Doxygen usage" # for information on how to generate the default header that doxygen normally # uses. # Note: The header is subject to change so you typically have to regenerate the # default header when upgrading to a newer version of doxygen. For a description # of the possible markers and block names see the documentation. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_HEADER = @abs_top_srcdir@/docs/head.html # The HTML_FOOTER tag can be used to specify a user-defined HTML footer for each # generated HTML page. If the tag is left blank doxygen will generate a standard # footer. See HTML_HEADER for more information on how to generate a default # footer and what special commands can be used inside the footer. See also # section "Doxygen usage" for information on how to generate the default footer # that doxygen normally uses. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_FOOTER = @abs_top_srcdir@/docs/foot.html # The HTML_STYLESHEET tag can be used to specify a user-defined cascading style # sheet that is used by each HTML page. It can be used to fine-tune the look of # the HTML output. If left blank doxygen will generate a default style sheet. # See also section "Doxygen usage" for information on how to generate the style # sheet that doxygen normally uses. # Note: It is recommended to use HTML_EXTRA_STYLESHEET instead of this tag, as # it is more robust and this tag (HTML_STYLESHEET) will in the future become # obsolete. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_STYLESHEET = @abs_top_srcdir@/docs/typical.css # The HTML_EXTRA_STYLESHEET tag can be used to specify additional user-defined # cascading style sheets that are included after the standard style sheets # created by doxygen. Using this option one can overrule certain style aspects. # This is preferred over using HTML_STYLESHEET since it does not replace the # standard style sheet and is therefore more robust against future updates. # Doxygen will copy the style sheet files to the output directory. # Note: The order of the extra style sheet files is of importance (e.g. the last # style sheet in the list overrules the setting of the previous ones in the # list). For an example see the documentation. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_EXTRA_STYLESHEET = # The HTML_EXTRA_FILES tag can be used to specify one or more extra images or # other source files which should be copied to the HTML output directory. Note # that these files will be copied to the base HTML output directory. Use the # $relpath^ marker in the HTML_HEADER and/or HTML_FOOTER files to load these # files. In the HTML_STYLESHEET file, use the file name only. Also note that the # files will be copied as-is; there are no commands or markers available. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_EXTRA_FILES = # The HTML_COLORSTYLE_HUE tag controls the color of the HTML output. Doxygen # will adjust the colors in the style sheet and background images according to # this color. Hue is specified as an angle on a colorwheel, see # http://en.wikipedia.org/wiki/Hue for more information. For instance the value # 0 represents red, 60 is yellow, 120 is green, 180 is cyan, 240 is blue, 300 # purple, and 360 is red again. # Minimum value: 0, maximum value: 359, default value: 220. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_COLORSTYLE_HUE = 60 # The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors # in the HTML output. For a value of 0 the output will use grayscales only. A # value of 255 will produce the most vivid colors. # Minimum value: 0, maximum value: 255, default value: 100. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_COLORSTYLE_SAT = 13 # The HTML_COLORSTYLE_GAMMA tag controls the gamma correction applied to the # luminance component of the colors in the HTML output. Values below 100 # gradually make the output lighter, whereas values above 100 make the output # darker. The value divided by 100 is the actual gamma applied, so 80 represents # a gamma of 0.8, The value 220 represents a gamma of 2.2, and 100 does not # change the gamma. # Minimum value: 40, maximum value: 240, default value: 80. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_COLORSTYLE_GAMMA = 100 # If the HTML_TIMESTAMP tag is set to YES then the footer of each generated HTML # page will contain the date and time when the page was generated. Setting this # to NO can help when comparing the output of multiple runs. # The default value is: YES. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_TIMESTAMP = YES # If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML # documentation will contain sections that can be hidden and shown after the # page has loaded. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_DYNAMIC_SECTIONS = NO # With HTML_INDEX_NUM_ENTRIES one can control the preferred number of entries # shown in the various tree structured indices initially; the user can expand # and collapse entries dynamically later on. Doxygen will expand the tree to # such a level that at most the specified number of entries are visible (unless # a fully collapsed tree already exceeds this amount). So setting the number of # entries 1 will produce a full collapsed tree by default. 0 is a special value # representing an infinite number of entries and will result in a full expanded # tree by default. # Minimum value: 0, maximum value: 9999, default value: 100. # This tag requires that the tag GENERATE_HTML is set to YES. HTML_INDEX_NUM_ENTRIES = 100 # If the GENERATE_DOCSET tag is set to YES, additional index files will be # generated that can be used as input for Apple's Xcode 3 integrated development # environment (see: http://developer.apple.com/tools/xcode/), introduced with # OSX 10.5 (Leopard). To create a documentation set, doxygen will generate a # Makefile in the HTML output directory. Running make will produce the docset in # that directory and running make install will install the docset in # ~/Library/Developer/Shared/Documentation/DocSets so that Xcode will find it at # startup. See http://developer.apple.com/tools/creatingdocsetswithdoxygen.html # for more information. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. GENERATE_DOCSET = NO # This tag determines the name of the docset feed. A documentation feed provides # an umbrella under which multiple documentation sets from a single provider # (such as a company or product suite) can be grouped. # The default value is: Doxygen generated docs. # This tag requires that the tag GENERATE_DOCSET is set to YES. DOCSET_FEEDNAME = "Doxygen generated docs" # This tag specifies a string that should uniquely identify the documentation # set bundle. This should be a reverse domain-name style string, e.g. # com.mycompany.MyDocSet. Doxygen will append .docset to the name. # The default value is: org.doxygen.Project. # This tag requires that the tag GENERATE_DOCSET is set to YES. DOCSET_BUNDLE_ID = org.doxygen.Project # The DOCSET_PUBLISHER_ID tag specifies a string that should uniquely identify # the documentation publisher. This should be a reverse domain-name style # string, e.g. com.mycompany.MyDocSet.documentation. # The default value is: org.doxygen.Publisher. # This tag requires that the tag GENERATE_DOCSET is set to YES. DOCSET_PUBLISHER_ID = org.doxygen.Publisher # The DOCSET_PUBLISHER_NAME tag identifies the documentation publisher. # The default value is: Publisher. # This tag requires that the tag GENERATE_DOCSET is set to YES. DOCSET_PUBLISHER_NAME = Publisher # If the GENERATE_HTMLHELP tag is set to YES then doxygen generates three # additional HTML index files: index.hhp, index.hhc, and index.hhk. The # index.hhp is a project file that can be read by Microsoft's HTML Help Workshop # (see: http://www.microsoft.com/en-us/download/details.aspx?id=21138) on # Windows. # # The HTML Help Workshop contains a compiler that can convert all HTML output # generated by doxygen into a single compiled HTML file (.chm). Compiled HTML # files are now used as the Windows 98 help format, and will replace the old # Windows help format (.hlp) on all Windows platforms in the future. Compressed # HTML files also contain an index, a table of contents, and you can search for # words in the documentation. The HTML workshop also contains a viewer for # compressed HTML files. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. GENERATE_HTMLHELP = NO # The CHM_FILE tag can be used to specify the file name of the resulting .chm # file. You can add a path in front of the file if the result should not be # written to the html output directory. # This tag requires that the tag GENERATE_HTMLHELP is set to YES. CHM_FILE = # The HHC_LOCATION tag can be used to specify the location (absolute path # including file name) of the HTML help compiler (hhc.exe). If non-empty, # doxygen will try to run the HTML help compiler on the generated index.hhp. # The file has to be specified with full path. # This tag requires that the tag GENERATE_HTMLHELP is set to YES. HHC_LOCATION = # The GENERATE_CHI flag controls if a separate .chi index file is generated # (YES) or that it should be included in the master .chm file (NO). # The default value is: NO. # This tag requires that the tag GENERATE_HTMLHELP is set to YES. GENERATE_CHI = NO # The CHM_INDEX_ENCODING is used to encode HtmlHelp index (hhk), content (hhc) # and project file content. # This tag requires that the tag GENERATE_HTMLHELP is set to YES. CHM_INDEX_ENCODING = # The BINARY_TOC flag controls whether a binary table of contents is generated # (YES) or a normal table of contents (NO) in the .chm file. Furthermore it # enables the Previous and Next buttons. # The default value is: NO. # This tag requires that the tag GENERATE_HTMLHELP is set to YES. BINARY_TOC = NO # The TOC_EXPAND flag can be set to YES to add extra items for group members to # the table of contents of the HTML help documentation and to the tree view. # The default value is: NO. # This tag requires that the tag GENERATE_HTMLHELP is set to YES. TOC_EXPAND = NO # If the GENERATE_QHP tag is set to YES and both QHP_NAMESPACE and # QHP_VIRTUAL_FOLDER are set, an additional index file will be generated that # can be used as input for Qt's qhelpgenerator to generate a Qt Compressed Help # (.qch) of the generated HTML documentation. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. GENERATE_QHP = NO # If the QHG_LOCATION tag is specified, the QCH_FILE tag can be used to specify # the file name of the resulting .qch file. The path specified is relative to # the HTML output folder. # This tag requires that the tag GENERATE_QHP is set to YES. QCH_FILE = # The QHP_NAMESPACE tag specifies the namespace to use when generating Qt Help # Project output. For more information please see Qt Help Project / Namespace # (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#namespace). # The default value is: org.doxygen.Project. # This tag requires that the tag GENERATE_QHP is set to YES. QHP_NAMESPACE = org.doxygen.Project # The QHP_VIRTUAL_FOLDER tag specifies the namespace to use when generating Qt # Help Project output. For more information please see Qt Help Project / Virtual # Folders (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#virtual- # folders). # The default value is: doc. # This tag requires that the tag GENERATE_QHP is set to YES. QHP_VIRTUAL_FOLDER = doc # If the QHP_CUST_FILTER_NAME tag is set, it specifies the name of a custom # filter to add. For more information please see Qt Help Project / Custom # Filters (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom- # filters). # This tag requires that the tag GENERATE_QHP is set to YES. QHP_CUST_FILTER_NAME = # The QHP_CUST_FILTER_ATTRS tag specifies the list of the attributes of the # custom filter to add. For more information please see Qt Help Project / Custom # Filters (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom- # filters). # This tag requires that the tag GENERATE_QHP is set to YES. QHP_CUST_FILTER_ATTRS = # The QHP_SECT_FILTER_ATTRS tag specifies the list of the attributes this # project's filter section matches. Qt Help Project / Filter Attributes (see: # http://qt-project.org/doc/qt-4.8/qthelpproject.html#filter-attributes). # This tag requires that the tag GENERATE_QHP is set to YES. QHP_SECT_FILTER_ATTRS = # The QHG_LOCATION tag can be used to specify the location of Qt's # qhelpgenerator. If non-empty doxygen will try to run qhelpgenerator on the # generated .qhp file. # This tag requires that the tag GENERATE_QHP is set to YES. QHG_LOCATION = # If the GENERATE_ECLIPSEHELP tag is set to YES, additional index files will be # generated, together with the HTML files, they form an Eclipse help plugin. To # install this plugin and make it available under the help contents menu in # Eclipse, the contents of the directory containing the HTML and XML files needs # to be copied into the plugins directory of eclipse. The name of the directory # within the plugins directory should be the same as the ECLIPSE_DOC_ID value. # After copying Eclipse needs to be restarted before the help appears. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. GENERATE_ECLIPSEHELP = NO # A unique identifier for the Eclipse help plugin. When installing the plugin # the directory name containing the HTML and XML files should also have this # name. Each documentation set should have its own identifier. # The default value is: org.doxygen.Project. # This tag requires that the tag GENERATE_ECLIPSEHELP is set to YES. ECLIPSE_DOC_ID = org.doxygen.Project # If you want full control over the layout of the generated HTML pages it might # be necessary to disable the index and replace it with your own. The # DISABLE_INDEX tag can be used to turn on/off the condensed index (tabs) at top # of each HTML page. A value of NO enables the index and the value YES disables # it. Since the tabs in the index contain the same information as the navigation # tree, you can set this option to YES if you also set GENERATE_TREEVIEW to YES. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. DISABLE_INDEX = YES # The GENERATE_TREEVIEW tag is used to specify whether a tree-like index # structure should be generated to display hierarchical information. If the tag # value is set to YES, a side panel will be generated containing a tree-like # index structure (just like the one that is generated for HTML Help). For this # to work a browser that supports JavaScript, DHTML, CSS and frames is required # (i.e. any modern browser). Windows users are probably better off using the # HTML help feature. Via custom style sheets (see HTML_EXTRA_STYLESHEET) one can # further fine-tune the look of the index. As an example, the default style # sheet generated by doxygen has an example that shows how to put an image at # the root of the tree instead of the PROJECT_NAME. Since the tree basically has # the same information as the tab index, you could consider setting # DISABLE_INDEX to YES when enabling this option. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. GENERATE_TREEVIEW = YES # The ENUM_VALUES_PER_LINE tag can be used to set the number of enum values that # doxygen will group on one line in the generated HTML documentation. # # Note that a value of 0 will completely suppress the enum values from appearing # in the overview section. # Minimum value: 0, maximum value: 20, default value: 4. # This tag requires that the tag GENERATE_HTML is set to YES. ENUM_VALUES_PER_LINE = 4 # If the treeview is enabled (see GENERATE_TREEVIEW) then this tag can be used # to set the initial width (in pixels) of the frame in which the tree is shown. # Minimum value: 0, maximum value: 1500, default value: 250. # This tag requires that the tag GENERATE_HTML is set to YES. TREEVIEW_WIDTH = 250 # If the EXT_LINKS_IN_WINDOW option is set to YES, doxygen will open links to # external symbols imported via tag files in a separate window. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. EXT_LINKS_IN_WINDOW = NO # Use this tag to change the font size of LaTeX formulas included as images in # the HTML documentation. When you change the font size after a successful # doxygen run you need to manually remove any form_*.png images from the HTML # output directory to force them to be regenerated. # Minimum value: 8, maximum value: 50, default value: 10. # This tag requires that the tag GENERATE_HTML is set to YES. FORMULA_FONTSIZE = 10 # Use the FORMULA_TRANPARENT tag to determine whether or not the images # generated for formulas are transparent PNGs. Transparent PNGs are not # supported properly for IE 6.0, but are supported on all modern browsers. # # Note that when changing this option you need to delete any form_*.png files in # the HTML output directory before the changes have effect. # The default value is: YES. # This tag requires that the tag GENERATE_HTML is set to YES. FORMULA_TRANSPARENT = YES # Enable the USE_MATHJAX option to render LaTeX formulas using MathJax (see # http://www.mathjax.org) which uses client side Javascript for the rendering # instead of using pre-rendered bitmaps. Use this if you do not have LaTeX # installed or if you want to formulas look prettier in the HTML output. When # enabled you may also need to install MathJax separately and configure the path # to it using the MATHJAX_RELPATH option. # The default value is: NO. # This tag requires that the tag GENERATE_HTML is set to YES. # Slows page loading too much USE_MATHJAX = NO # When MathJax is enabled you can set the default output format to be used for # the MathJax output. See the MathJax site (see: # http://docs.mathjax.org/en/latest/output.html) for more details. # Possible values are: HTML-CSS (which is slower, but has the best # compatibility), NativeMML (i.e. MathML) and SVG. # The default value is: HTML-CSS. # This tag requires that the tag USE_MATHJAX is set to YES. MATHJAX_FORMAT = HTML-CSS # When MathJax is enabled you need to specify the location relative to the HTML # output directory using the MATHJAX_RELPATH option. The destination directory # should contain the MathJax.js script. For instance, if the mathjax directory # is located at the same level as the HTML output directory, then # MATHJAX_RELPATH should be ../mathjax. The default value points to the MathJax # Content Delivery Network so you can quickly see the result without installing # MathJax. However, it is strongly recommended to install a local copy of # MathJax from http://www.mathjax.org before deployment. # The default value is: http://cdn.mathjax.org/mathjax/latest. # This tag requires that the tag USE_MATHJAX is set to YES. MATHJAX_RELPATH = http://cdn.mathjax.org/mathjax/latest # The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax # extension names that should be enabled during MathJax rendering. For example # MATHJAX_EXTENSIONS = TeX/AMSmath TeX/AMSsymbols # This tag requires that the tag USE_MATHJAX is set to YES. MATHJAX_EXTENSIONS = # The MATHJAX_CODEFILE tag can be used to specify a file with javascript pieces # of code that will be used on startup of the MathJax code. See the MathJax site # (see: http://docs.mathjax.org/en/latest/output.html) for more details. For an # example see the documentation. # This tag requires that the tag USE_MATHJAX is set to YES. MATHJAX_CODEFILE = # When the SEARCHENGINE tag is enabled doxygen will generate a search box for # the HTML output. The underlying search engine uses javascript and DHTML and # should work on any modern browser. Note that when using HTML help # (GENERATE_HTMLHELP), Qt help (GENERATE_QHP), or docsets (GENERATE_DOCSET) # there is already a search function so this one should typically be disabled. # For large projects the javascript based search engine can be slow, then # enabling SERVER_BASED_SEARCH may provide a better solution. It is possible to # search using the keyboard; to jump to the search box use + S # (what the is depends on the OS and browser, but it is typically # , /