pax_global_header00006660000000000000000000000064135006512660014516gustar00rootroot0000000000000052 comment=214d8be0a3f9cad512636246c967f97249f60996 herisvm-0.9.0/000077500000000000000000000000001350065126600132015ustar00rootroot00000000000000herisvm-0.9.0/Makefile000066400000000000000000000002341350065126600146400ustar00rootroot00000000000000PROJECTNAME = herisvm SUBPRJ = doc scripts:tests MKC_REQD = 0.28.0 NODEPS = *:test-tests test : all-tests test-tests @: .include herisvm-0.9.0/README000066400000000000000000000006051350065126600140620ustar00rootroot00000000000000herisvm project is a collection of simple tools implementing evaluation algorithms for classification (machine learning). In particular heri-eval implements N-fold cross-validation where training and testing is run in parallel. This may be useful if you use multi-CPU computer. Run heri-eval -h, heri-stat -h and heri-split -h for documentation and examples. Also see doc/ subdirectory. herisvm-0.9.0/doc/000077500000000000000000000000001350065126600137465ustar00rootroot00000000000000herisvm-0.9.0/doc/INSTALL000066400000000000000000000007641350065126600150060ustar00rootroot00000000000000Build time dependencies: -- mk-configure (https://github.com/cheusov/mk-configure) is needed for building and installing the project -- pod2man script Runtime dependencies: -- bash -- ruby>=1.9.3 -- modern awk (gnu awk and nawk are good enough) Examples of how to build # cd herisvm-x.y.z # mkcmake all # mkcmake install or $ cd herisvm-x.y.z $ export PREFIX=/usr MANDIR=/usr/share/man SYSCONFDIR=/etc $ mkcmake all $ mkcmake install DESTDIR=/tmp/destdir herisvm-0.9.0/doc/LICENSE000066400000000000000000000021551350065126600147560ustar00rootroot00000000000000Copyright (c) 2015 Alexandra Figlovskaya Copyright (c) 2015 Aleksey Cheusov Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. herisvm-0.9.0/doc/Makefile000066400000000000000000000001701350065126600154040ustar00rootroot00000000000000FILES = LICENSE NEWS ../README TODO FILESDIR = ${DOCDIR} DOCDIR ?= ${DATADIR}/doc/herisvm .include herisvm-0.9.0/doc/NEWS000066400000000000000000000002331350065126600144430ustar00rootroot00000000000000====================================================================== Version 0.1.0, Sat, 13 Jun 2015 12:53:02 +0300 initial publicly available release herisvm-0.9.0/doc/TODO000066400000000000000000000003321350065126600144340ustar00rootroot00000000000000* heri-eval: - heri-eval -T: target class - Repeated random sub-sampling heri-eval -t 10 -r 60 ... - Alternative formats (crfsuite) for heri-split - Support for IE (no classes, just information extraction) herisvm-0.9.0/scripts/000077500000000000000000000000001350065126600146705ustar00rootroot00000000000000herisvm-0.9.0/scripts/Makefile000066400000000000000000000002441350065126600163300ustar00rootroot00000000000000SCRIPTS = heri-eval heri-split heri-stat heri-stat-addons MAN = heri-eval.1 heri-split.1 heri-stat.1 heri-stat-addons.1 CLEANFILES = ${MAN} .include herisvm-0.9.0/scripts/heri-eval000077500000000000000000000254701350065126600165020ustar00rootroot00000000000000#!/usr/bin/env bash # Copyright (c) 2015 Alexandra Figlovskaya # Copyright (c) 2015-2019 Aleksey Cheusov # # Permission is hereby granted, free of charge, to any person obtaining # a copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. # variables settable by user : ${SVM_TRAIN_CMD:=svm-train} : ${SVM_PREDICT_CMD:=svm-predict} : ${SVM_HERI_STAT_CMD:=heri-stat} : ${SVM_HERI_STAT_ADDONS_CMD:=heri-stat-addons} : ${SVM_HERI_SPLIT_CMD:=heri-split} : ${TMPDIR:=/tmp} ############################################################ set -e export LC_ALL=C indent2 (){ sed '/./ s/^/ /' "$@" } sig_handler (){ on_exit trap - "$1" kill -"$1" $$ } on_exit(){ show_stderr if test -z "$keep_tmp"; then if test -n "$tmp_dir"; then rm -rf "$tmp_dir" fi else echo "Temporary files are here $tmp_dir" 1>&2 fi } calculate_feature_count (){ awk '{ for (i=2; i <= NF; ++i) { if ($i + 0 > m) m = $i + 0 } } END { print m+1 }' "$@" } calculate_feature_count (){ awk '{ for (i=2; i <= NF; ++i) { if ($i + 0 > m) m = $i + 0 } } END { print m+1 }' "$@" } predictions_from_testing_sets (){ if ! test -s "$tmp_dir/testing_fold.txt"; then cat "$tmp_dir/outcome_and_prediction1.txt" return fi awk ' FNR == NR { # reading testing_fold.txt ++obj_num[$1] testobj[$1,obj_num[$1]] = NR next } # reading predictions on testing folds FNR == 1 { ++fold_num } { idx = testobj[fold_num, FNR] prediction [idx] = $0 } END { if ((NR % 2) != 0){ print "internal error!" > "/dev/stderr" exit 12 } count = NR/2 for (i=1; i <= count; ++i){ print prediction [i] } }' "$tmp_dir/testing_fold.txt" $prediction_all } show_stderr (){ if test -z "$last"; then return fi for i in `seq $last`; do # fn="$tmp_dir/train_stderr${i}.txt" if test -s "$fn"; then echo "---- train stderr $i ----" 1>&2 cat -- "$fn" 1>&2 fi # fn="$tmp_dir/predict_stderr${i}.txt" if test -s "$fn"; then echo "---- predict stderr $i ----" 1>&2 cat -- "$fn" 1>&2 fi done } wait_all (){ local i local ex ex=0 for i in `seq $last`; do if wait ${pid[$i]}; then : else ex=$? fi done return "$ex" } # heri-eval -t10 -n 5 dataset.libsvm # 10*5-fold cross-validation usage(){ cat 1>&2 <<'EOF' usage: heri-eval [OPTIONS] training_set [-- SVM_TRAIN_OPTIONS] OPTIONS: -h Help message -n The number of folds for T*N-fold cross-validation -r The ratio (in percents) for training set for hold-out -e testing_set Testing set for hold-out -t The number of runs for T*N-fold cross-validation or holdouts -T Threshold for score -o Save predictions from testing sets to the specified file (outcome_tag prediction_tag [score]) -O Save incorrectly classified objects to the specified file (#object_number: outcome_tag prediction_tag [score]) -m Save confusion matrix to the specified file (frequency : outcome_tag prediction_tag) -f Enable output of per-fold statistics (see -Mf) -M Output mode: t -- output total statistics, f -- output per-fold statistics, c -- output cross-fold statistics. -s Options passed to heri-split(1) -p Options passed to heri-stat(1) -S Seed value passed to heri-split(1). If it is not specified, the dataset is splitted into training and testing datasets randomly. -K Keep temporary directory after exiting -D Debugging mode, implies -K SVM_TRAIN_OPTIONS: options passed to svm-train(1) and alike Environment variables: SVM_TRAIN_CMD -- training utility, e.g., liblinear-train (the default is svm-train) SVM_PREDICT_CMD -- predicting utility, e.g., liblinear-predict (the default is svm-predict) TMPDIR -- temporary directory (the default is /tmp) Examples: Ex1: heri-eval -e testing_set.libsvm training_set.libsvm -- -s 0 -t 0 Ex2: export SVM_TRAIN_CMD='liblinear-train' export SVM_PREDICT_CMD='liblinear-predict' heri-eval -p '-mr' -n 5 training_set.libsvm -- -s 4 -q heri-eval -p '-mr' -n 5 training_set.libsvm -- -s 4 -q Ex3: export SVM_TRAIN_CMD='scikit_rf-train --estimators=400' export SVM_PREDICT_CMD='scikit_rf-predict' heri-eval -p '-c' -Mt -t 50 -r 70 dataset.libsvm EOF } seed=$RANDOM runs=1 output_mode=tc times=1 while getopts De:fhKm:M:n:o:O:p:r:s:S:t:T: f; do case "$f" in '?') usage exit 1;; h) usage exit 0;; n) number_of_folds="$OPTARG";; e) testing_set="$OPTARG";; t) times="$OPTARG";; T) heristat_args="$heristat_args -t$OPTARG";; r) ratio="$OPTARG";; m) confusion_matrix="$OPTARG";; o) predictions="$OPTARG";; O) incorrect_predictions="$OPTARG";; s) herisplit_args="$herisplit_args $OPTARG";; p) heristat_args="$heristat_args $OPTARG";; f) output_mode="f$output_mode";; M) output_mode="$OPTARG";; S) seed="$OPTARG";; K) keep_tmp=1;; D) keep_tmp=1 debug=1;; esac done shift `expr $OPTIND - 1` while test "$#" -gt 0; do case "$1" in --) shift break;; *) print_sh=`printf '%q' "$1"` files="$files $print_sh" shift;; esac done trap "sig_handler INT" INT trap "on_exit" 0 if test -z "$number_of_folds" -a -z "$testing_set" -a -z "$ratio"; then echo 'Either -n or -r or -e must be specified, run heri-eval -h for details' 1>&2 exit 1 fi if test -z "$files"; then echo 'Training set is mandatory, run heri-eval -h for details' 1>&2 exit 1 fi tmp_dir=`mktemp -d "$TMPDIR"/svm.XXXXXX` training_testing (){ if test -n "$number_of_folds"; then ${SVM_HERI_SPLIT_CMD} $herisplit_args -c "$number_of_folds" -d "$tmp_dir" -s "$seed" $files seed="$((seed+1))" last="$number_of_folds" for i in `seq $number_of_folds`; do mv "$tmp_dir/test$i.txt" "$tmp_dir/test$i.libsvm" mv "$tmp_dir/train$i.txt" "$tmp_dir/train$i.libsvm" done elif test -n "$ratio"; then ${SVM_HERI_SPLIT_CMD} $herisplit_args -R "$ratio" -d "$tmp_dir" -s "$seed" $files mv "$tmp_dir/test.txt" "$tmp_dir/test1.libsvm" mv "$tmp_dir/train.txt" "$tmp_dir/train1.libsvm" rm "$tmp_dir/testing_fold.txt" seed="$((seed+1))" last=1 else eval "cat -- $files" > "$tmp_dir/train1.libsvm" cp "$testing_set" "$tmp_dir/test1.libsvm" last=1 fi for i in `seq $last`; do ${SVM_TRAIN_CMD} "$@" "$tmp_dir/train$i.libsvm" "$tmp_dir/model$i.bin" \ 2> "$tmp_dir/train_stderr${i}.txt" \ > "$tmp_dir/train_stdout${i}.txt" & pid[$i]=$! done wait_all for i in `seq $last`; do ${SVM_PREDICT_CMD} "$tmp_dir/test$i.libsvm" "$tmp_dir/model$i.bin" \ "$tmp_dir/prediction${i}.txt" \ 2> "$tmp_dir/predict_stderr${i}.txt" \ > "$tmp_dir/predict_stdout${i}.txt" & pid[$i]=$! done wait_all rm -f "$tmp_dir/outcome.txt" "$tmp_dir/prediction.txt" } show_stat (){ for t in `seq $times`; do prediction_all='' for i in `seq $last`; do awk '{print $1}' "$tmp_dir/test${t}_$i.libsvm" > "$tmp_dir/outcome${t}_${i}.txt" paste "$tmp_dir/outcome${t}_${i}.txt" "$tmp_dir/prediction${t}_${i}.txt" | \ tr ' ' ' ' > "$tmp_dir/outcome_and_prediction${t}_${i}.txt" ${SVM_HERI_STAT_CMD} -1R $heristat_args \ "$tmp_dir/outcome_and_prediction${t}_${i}.txt" > "$tmp_dir/stats${t}_${i}.txt" if [[ "_$output_mode" =~ f ]]; then echo "Fold ${t}x$i statistics" ${SVM_HERI_STAT_CMD} -1 $heristat_args "$tmp_dir/outcome_and_prediction${t}_${i}.txt" | indent2 echo '' fi ln -f "$tmp_dir/outcome_and_prediction${t}_${i}.txt" "$tmp_dir/outcome_and_prediction${i}.txt" prediction_all="$prediction_all $tmp_dir/outcome_and_prediction${t}_${i}.txt" done done } export HERISVM_FC=`calculate_feature_count $files $testing_set` for t in `seq $times`; do training_testing "$@" # ls -l "$tmp_dir/" for i in `seq $last`; do mv "$tmp_dir/train${i}.libsvm" "$tmp_dir/train${t}_$i.libsvm" mv "$tmp_dir/test${i}.libsvm" "$tmp_dir/test${t}_$i.libsvm" mv "$tmp_dir/prediction${i}.txt" "$tmp_dir/prediction${t}_${i}.txt" mv "$tmp_dir/train_stderr${i}.txt" "$tmp_dir/train_stderr${t}_$i.txt" mv "$tmp_dir/train_stdout${i}.txt" "$tmp_dir/train_stdout${t}_$i.txt" mv "$tmp_dir/predict_stderr${i}.txt" "$tmp_dir/predict_stderr${t}_$i.txt" mv "$tmp_dir/predict_stdout${i}.txt" "$tmp_dir/predict_stdout${t}_$i.txt" if test -f "$tmp_dir/model${i}.bin"; then mv "$tmp_dir/model${i}.bin" "$tmp_dir/model${t}_$i.bin" fi done # rm "$tmp_dir/test${i}.txt" "$tmp_dir/prediction${i}.txt" done #echo before test #ls -l "$tmp_dir" show_stat #echo after test predictions_from_testing_sets > "$tmp_dir/prediction.txt" # -o if test -n "$predictions"; then cp "$tmp_dir/prediction.txt" "$predictions" fi # -O if test -n "$incorrect_predictions"; then awk '$1 != $2 {print "#" NR, $0}' "$tmp_dir/prediction.txt" \ > "$incorrect_predictions" fi # -m if test -n "$confusion_matrix"; then awk '$1 != $2 {print $1, $2}' "$tmp_dir/prediction.txt" | sort | uniq -c | sort -rn | awk '{print $1, ":", $2, $3}' > "$confusion_matrix" fi rm "$tmp_dir/prediction.txt" # if [[ "_$output_mode" =~ t ]]; then echo 'Total statistics' ${SVM_HERI_STAT_CMD} -1 $heristat_args "$tmp_dir"/outcome_and_prediction*_*.txt | indent2 echo '' fi if test -n "$number_of_folds" && [[ "_$output_mode" =~ c ]]; then echo 'Total cross-folds statistics' ${SVM_HERI_STAT_ADDONS_CMD} "$tmp_dir"/stats*.txt | indent2 fi herisvm-0.9.0/scripts/heri-eval.pod000066400000000000000000000065111350065126600172530ustar00rootroot00000000000000=head1 NAME heri-eval - evaluate classification algorithm =head1 SYNOPSIS B [OPTIONS] I [-- SVM_TRAIN_OPTIONS] =head1 DESCRIPTION B runs training algorithm on I and then evaluate it using testing set, specified by option I<-e>. If option I<-n> was applied, cross-validation is used for evaluation, training and testing on different folds are run in parallel, thus utilizing available CPUs. If I<-r> is used, the dataset is splitted into training and testing datasets randomly with the specified ratio, and then holdout is run. =head1 OPTIONS =over 6 =item B<-h, --help> Display help information. =item B<-f> Enable output of per-fold statistics. See B<-M>I. =item B<-n> I Enable T*I-fold cross-validation mode and set the number of folds to I. =item B<-r> I Split the dataset into training and testing parts with the specified ratio of their sizes (in percents). =item B<-t> I Enable I*N-fold cross-validation mode and set the number of runs to I which 1 by default. =item B<-e> I Enable hold-out mode and set the testing dataset. =item B<-T> I Set the minimum threshold for making a classification decision. If this flag is applied, micro-average precision, recall, and F1 are calculated instead of accuracy. =item B<-o> I Save predictions from testing sets to the specified file. Format: outcome_class prediction_class [score] =item B<-O> I Save incorrectly classified objects to the specified file. Format: #object_number: outcome_class prediction_class [score]) =item B<-m> I Save confusion matrix to the specified file. Format: frequency : outcome_class prediction_class =item B<-p> I Pass the specified I to B. =item B<-s> I Pass the specified I to B. =item B<-M> I Sets the output mode where chars are: t -- output total statistics, f -- output per-fold statistics, c -- output cross-fold statistics. The default is "-M tc". =item B<-S> I Pass the specified I to B. =item B<-K> Keep temporary directory after exiting. =item B<-D> Turn on the debugging mode, implies -K. =back =head1 EXAMPLES =over 1 heri-eval -e testing_set.libsvm training_set.libsvm -- -s 0 -t 0 export SVM_TRAIN_CMD='liblinear-train' export SVM_PREDICT_CMD='liblinear-predict' heri-eval -p '-mr' -n 5 training_set.libsvm -- -s 4 -q heri-eval -p '-mr' -n 5 training_set.libsvm -- -s 4 -q export SVM_TRAIN_CMD='scikit_rf-train --estimators=400' export SVM_PREDICT_CMD='scikit_rf-predict' heri-eval -p '-c' -Mt -t 50 -r 70 dataset.libsvm =back =head1 ENVIRONMENT =over 6 =item I Training utility, e.g., liblinear-train (the default is svm-train). =item I Predicting utility, e.g., liblinear-predict (the default is svm-predict). =item I Utility for calculating statistics (the default is B). =item I Utility for calculating additional statistics (the default is B). =item I Utility for splitting the dataset (the default is B). =item I Temporary directory (the default is /tmp). =back =head1 HOME L =head1 SEE ALSO L L herisvm-0.9.0/scripts/heri-split000077500000000000000000000151141350065126600167000ustar00rootroot00000000000000#!/usr/bin/env ruby # Copyright (c) 2015 Alexandra Figlovskaya # Copyright (c) 2015-2019 Aleksey Cheusov # # Permission is hereby granted, free of charge, to any person obtaining # a copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. require 'optparse' $options = {} $fold_cnt = nil $ratio = nil $tmp_dir = nil $seed = Random.new_seed $stratified = true OptionParser.new do |opts| opts.banner = <.txt and train.txt files (also in svmlight format) are created, where N is the number of fold. If option -R is specified, test.txt and train.txt files are created for the same purposes. Also testing_fold.txt file is created, where for each object (one per line) its testing fold number is specified if oprion -c is applied. The file testing_fold.txt contain either 1 for testing set and 0 for training set, if option -R is applied. Usage: heri-split [OPTIONS] dataset1 [dataset2...] OPTIONS: EOF opts.on('-h', '--help','display this message and exit') do puts opts exit 0 end opts.on("-cFOLD_CNT", "--folds=FOLD_CNT", "A number if folds (mandatory option)") do |c| $fold_cnt = c.to_i end opts.on("-dDIR", "--output-dir=DIR", "Output directory (mandatory option)") do |d| $tmp_dir = d end opts.on("-sSEED", "--seed=SEED", "Seed for pseudo-random number generator") do |s| if s != "" then $seed = s.to_i end end opts.on("-r", "--random", "Use random split instead of stratified") do $stratified = false end opts.on("-R", "--ratio=RATIO", "Split input dataset into training and testing sets with the specified RATIO (in percents)") do |r| $ratio = r.to_i $fold_cnt = 0 end opts.separator " " end.parse! if $tmp_dir == nil or ($fold_cnt == nil and $ratio == nil) then STDERR.puts("Options -c/-R and -d are mandatory, see heri-split -h for details") exit(1) end $rnd = Random.new($seed) # same as in StratifiedSplitter $files_test = [] $files_train = [] $testing_fold = File.open($tmp_dir+"/testing_fold.txt", 'w:ASCII-8BIT') (1..$fold_cnt).each do |i| name_train = "train" + "#{i.to_i}" name_test = "test" + "#{i.to_i}" $files_test << File.open($tmp_dir+"/"+name_test+".txt", 'w:ASCII-8BIT') $files_train << File.open($tmp_dir+ "/"+ name_train+".txt", 'w:ASCII-8BIT') end if $ratio != nil $files_test << File.open($tmp_dir+"/test.txt", 'w:ASCII-8BIT') $files_train << File.open($tmp_dir+ "/train.txt", 'w:ASCII-8BIT') end def random_split_cv() nums = [] curr_number = 0 ARGV.each do |fn| File.open(fn, "r:ASCII-8BIT").each_line do |line| if line =~ /^([^\s]+)\s/ nums << curr_number % $fold_cnt curr_number += 1 end end end nums.shuffle!(random: $rnd) curr_number = 0 ARGV.each do |fn| File.open(fn, "r:ASCII-8BIT").each_line do |line| if line =~ /^([^\s]+)\s/ fold_num = nums[curr_number] $fold_cnt.times do |n| if fold_num == n $files_test[n].puts line $testing_fold.puts n+1 else $files_train[n].puts line end end curr_number += 1 end end end end def random_split_holdout() nums = [] threshold = 1 line_count = 0 ARGV.each do |fn| File.open(fn, "r:ASCII-8BIT").each_line do |line| if line =~ /^([^\s]+)\s/ line_count += 1 nums << $rnd.rand() end end end threshold = (line_count.to_f * $ratio / 100).to_i if threshold == 0 threshold = 1 end threshold = nums.sort[threshold] curr_number = 0 ARGV.each do |fn| File.open(fn, "r:ASCII-8BIT").each_line do |line| if line =~ /^([^\s]+)\s/ if nums[curr_number] < threshold $files_train[0].puts line $testing_fold.puts '0' else $files_test[0].puts line $testing_fold.puts '1' end curr_number += 1 end end end end def random_split() if $fold_cnt != nil and $fold_cnt > 0 random_split_cv() else random_split_holdout() end end def stratified_split() classes = Hash.new(0) ARGV.each do |fn| File.open(fn, "r:ASCII-8BIT").each_line do |line| if line =~ /^([^\s]+)\s/ classes[$1] += 1 end end end classes_arr = {} classes.each do |x, y| arr = [] y.times do |i| arr << i end arr.shuffle!(random: $rnd) classes_arr [x] = {} arr.each_index do |i| if $ratio != nil cnt = (y.to_f * $ratio / 100).to_i fold_train = (arr[i] < cnt) classes_arr[x][i] = fold_train else fold_train = (i * $fold_cnt.to_f / arr.size).to_i classes_arr[x][arr[i]] = fold_train end end end num_line = Hash.new(0) ARGV.each do |fn| File.open(fn, "r:ASCII-8BIT").each_line do |line| if line =~ /^([^\s]+)\s/ curr_number = num_line[$1] if $ratio != nil if classes_arr[$1][curr_number] $files_train[0].puts line $testing_fold.puts "0" else $files_test[0].puts line $testing_fold.puts "1" end else $fold_cnt.times do |n| if classes_arr[$1][curr_number] == n $files_test[n].puts line $testing_fold.puts n+1 else $files_train[n].puts line end end end num_line[$1] += 1 end end end end if $stratified stratified_split() else random_split() end $files_test.each { |x| x.close } $files_train.each { |x| x.close } $testing_fold.close herisvm-0.9.0/scripts/heri-split.pod000066400000000000000000000030271350065126600174560ustar00rootroot00000000000000=head1 NAME heri-split - splits the dataset into training and testing sets =head1 SYNOPSIS B [OPTIONS] I [I...] =head1 DESCRIPTION B splits the dataset into several training and testing sets as it is required for N-fold cross-validation. Dataset contains one object per line as in svmlight format. By default stratified sampling is used. That is, all folds contain the same number of objects for each label. If option B<-c> is specified, testI.txt and trainI.txt files (also in svmlight format) are created, where I is the number of fold. If option B<-R> is specified, test.txt and train.txt files are created for the same purposes. Also testing_fold.txt file is created, where for each object (one per line) its testing fold number is specified if oprion B<-c> is applied. The file testing_fold.txt contain either 1 for testing set and 0 for training set, if option B<-R> is applied. =head1 OPTIONS =over 6 =item B<-h, --help> Display help information. =item B<-c, --folds> I Set the number of folds. This is a mandatory option. =item B<-d, --output-dir> I Set the output directory. This is a mandatory option. =item B<-r,--random> Use random sampling instead of stratified one. =item B<-R,--ratio> Split the input dataset into training and testing one in the specified ratio (in percents). =item B<-s, --seed> I Set the seed value for pseudorandom generator. =back =head1 HOME L =head1 SEE ALSO L L herisvm-0.9.0/scripts/heri-stat000077500000000000000000000240061350065126600165200ustar00rootroot00000000000000#!/usr/bin/env ruby # Copyright (c) 2015 Alexandra Figlovskaya # Copyright (c) 2015-2019 Aleksey Cheusov # # Permission is hereby granted, free of charge, to any person obtaining # a copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. require 'optparse' require 'set' @options = {} @err = nil @unspecified_class="__strelka_i_raketa__" def print_pretty(class_name, p, p_comment, r, r_comment, f1, f1_comment) puts "%13s P, R, F1: %-6.4g %-13s, %-6.4g %-13s, %-6.4g" \ % [class_name, p, p_comment, r, r_comment, f1, f1_comment] end def print_accuracy_pretty(a, a_comment) puts "Accuracy : %-6.4g %-13s" % [a, a_comment] end def print_raw(class_name, p, p_comment, r, r_comment, f1, f1_comment) puts "#{class_name}\tP\t#{p}\t#{p_comment.strip}" puts "#{class_name}\tR\t#{r}\t#{r_comment.strip}" puts "#{class_name}\tF1\t#{f1}\t#{f1_comment.strip}" end def print_accuracy_raw(a, a_comment) puts "\tA\t#{a}\t#{a_comment.strip}" end def print_stat(class_name, p, p_comment, r, r_comment, f1, f1_comment) if @options[:raw] print_raw(class_name, p, p_comment, r, r_comment, f1, f1_comment) else print_pretty(class_name, p, p_comment, r, r_comment, f1, f1_comment) end end def print_accuracy(a, a_comment) if @options[:raw] print_accuracy_raw(a, a_comment) else print_accuracy_pretty(a, a_comment) end end def pretty_div(a, b) "%5s/%-5s" % [a, b] end def normalize_tag(tag) tag = tag.to_s.sub(/^[+]/, "") # +1 => 1 if tag =~ /^-?[0-9]+[.][0-9]+$/ tag = tag.sub(/[.]0+$/, "") # -1.0000 => -1 end return tag end def split_into_3(line, fn) line = line.gsub(/\s+/, " ").strip() ret = ["", "", Float::MAX] tokens = line.split(/ /) case tokens.size when 2 ret = [normalize_tag(tokens[0]), normalize_tag(tokens[1]), Float::MAX] when 3 ret = [normalize_tag(tokens[0]), normalize_tag(tokens[1]), tokens[2].to_f] else ret = [normalize_tag(tokens[0]), normalize_tag(tokens[1]), Float::MAX] line.sub!(/^fake ?/, "") STDERR.puts("Bad line '#{line}' in file '#{fn}'") @err = 1 end if ret [2] < @options[:treshold] ret [1] = @unspecified_class end return ret end def mse(outcomes, predictions) sum = 0.0 outcomes.each_index do |idx| sum += (outcomes[idx] - predictions[idx]) ** 2 end return sum / outcomes.length.to_f end def rmse(outcomes, predictions) return Math.sqrt(mse(outcomes, predictions)) end def mae(outcomes, predictions) sum = 0.0 outcomes.each_index do |idx| sum += (outcomes[idx] - predictions[idx]).abs end return sum / outcomes.length.to_f end @options[:excluded] = Set.new() OptionParser.new do |opts| opts.banner = < heri-stat -1 [-g mode] [-R] [OPTIONS] [files...] OPTIONS: EOF opts.on('-h', '--help','display this message and exit') do puts opts exit 0 end @options[:raw] = false opts.on('-R', '--raw','raw tab-separated output') do @options[:raw] = true end @options[:micro_avg] = false opts.on('-m', '--micro-avg','disable micro averaged P/R/F1 output') do @options[:micro_avg] = true end @options[:macro_avg] = false opts.on('-r', '--macro-avg','disable macro averaged P/R/F1 output') do @options[:macro_avg] = true end @options[:statistics] = false opts.on('-c', '--per-class','disable output of per-class statistics') do @options[:statistics] = true end @options[:accuracy] = false opts.on('-a', '--accuracy','disable output of accuracy') do @options[:accuracy] = true end @options[:single] = false opts.on('-1', '--single','obtain both outcomes and predicted classes from single source. If this option is specified, the first token on input represents the outcome class and second one -- predicted class') do @options[:single] = true end @options[:unclassified] = false opts.on("-u", "--unclassified=UNCLASSIFIED", 'set the label for "unclassified" object') do |u| @options[:unclassified] = u.to_s end @options[:treshold] = -Float::MAX opts.on("-t", "--treshold=TRESHOLD", 'Minimal treshold for score') do |u| @options[:treshold] = u.to_f @options[:unclassified] = @unspecified_class end @options[:excluded_labels] = false opts.on("-x", "--exclude=LABELS", 'exclude the specified labels from statistics') do |x| @options[:excluded] = Set.new(x.split(",")) end opts.on("-g", "--regression=mode", 'Calculate MAE, MSE or RMSE') do |g| @options[:regression] = true @options[:mse] = g.include?("s") @options[:rmse] = g.include?("r") @options[:mae] = g.include?("a") if /[^sra]/ =~ g STDERR.puts "Invalid mode '#{g}' for option -g" exit 1 end end opts.separator " " end.parse! if @options[:unclassified] @options[:accuracy]=true else @options[:micro_avg]=true end def get_up_to_two_tokens(s) s = s.strip s = s.sub(/^([^\s]+\s([^\s]+))\s.*$/, "\\1") return s end def get_up_to_three_tokens(s) s = s.strip s = s.sub(/^([^\s]+\s([^\s]+)\s([^\s]+))\s.*$/, "\\1") return s end if @options[:single] outcome_tags = [] prediction_tags = [] while line = gets do gt, rt, fake = split_into_3(get_up_to_three_tokens(line), "") outcome_tags << gt prediction_tags << rt end else outcome_tags = IO.read(ARGV[0]).split("\n").map! do |x| get_up_to_two_tokens(x) end prediction_tags = IO.read(ARGV[1]).split("\n").map! do |x| get_up_to_two_tokens(x) end if outcome_tags.length != prediction_tags.length STDERR.puts("Dataset and prediction files should contain the same amount of lines"); exit 1 end outcome_tags.each_index do |i| fake1, outcome_tags[i], fake = split_into_3("fake " + outcome_tags[i], ARGV[0]) fake1, prediction_tags[i], fake = split_into_3("fake " + prediction_tags[i], ARGV[1]) end end exit 1 if @err if @options[:regression] outcome_tags.map! {|x| x.to_f} prediction_tags.map! {|x| x.to_f} mae_value = mae(outcome_tags, prediction_tags) mse_value = mse(outcome_tags, prediction_tags) rmse_value = rmse(outcome_tags, prediction_tags) if @options[:mse] if @options[:raw] puts "\tMSE\t#{mse_value}" else puts "MSE: #{mse_value}" end end if @options[:rmse] if @options[:raw] puts "\tRMSE\t#{rmse_value}" else puts "RMSE: #{rmse_value}" end end if @options[:mae] if @options[:raw] puts "\tMAE\t#{mae_value}" else puts "MAE: #{mae_value}" end end exit 0 end tag2outcome_cnt = Hash.new(0) tag2prediction_cnt = Hash.new(0) tag2TP_cnt = Hash.new(0) all_precision = 0 all_recall = 0 outcome_tags.each_index do |i| gt = outcome_tags[i] rt = prediction_tags[i] # make sure hash cell exists tag2TP_cnt[gt] += 0 tag2prediction_cnt[gt] += 0 tag2outcome_cnt[gt] += 1 if gt != @options[:unclassified] if @options[:excluded].include?(rt) next end if rt != @options[:unclassified] tag2prediction_cnt[rt] += 1 tag2TP_cnt[rt] += (gt == rt ? 1 : 0) end end @options[:excluded].each do |excluded| tag2TP_cnt.delete(excluded) tag2prediction_cnt.delete(excluded) end all_tp = 0 all_f1 = 0 res_tag2TP_cnt = tag2TP_cnt.sort_by { |key, value| key } res_tag2TP_cnt.each do |t, tp| if @options[:excluded].include?(t) next end p = (tag2prediction_cnt[t] > 0.0 ? tp.to_f / tag2prediction_cnt[t] : 0.0) r = (tag2outcome_cnt[t] > 0.0 ? tp.to_f / tag2outcome_cnt[t] : 0.0) f1 = (p+r > 0.0 ? 2*p*r / (p+r) : 0.0) if !@options[:statistics] print_stat("Class %-6s" % [t], p, pretty_div(tp, tag2prediction_cnt[t]), r, pretty_div(tp, tag2outcome_cnt[t]), f1, "") end all_precision += p all_recall += r all_tp += tp all_f1 += f1 end all_rt = 0 tag2prediction_cnt.each do |tag, rt| all_rt += rt end all_gt = 0 tag2outcome_cnt.each do |tag, gt| all_gt += gt end included_labels_count = res_tag2TP_cnt.size() if included_labels_count == 1 @options[:accuracy] = true @options[:micro_avg] = true @options[:macro_avg] = true end if !@options[:accuracy] accuracy = all_tp.to_f / all_rt.to_f print_accuracy(accuracy, pretty_div(all_tp, all_rt)) end if !@options[:micro_avg] micro_avg_precision = all_tp.to_f / all_rt.to_f micro_avg_recall = all_tp.to_f / all_gt.to_f micro_avg_f1 = 2*micro_avg_precision*micro_avg_recall / (micro_avg_precision+micro_avg_recall) print_stat("Micro average", micro_avg_precision, pretty_div(all_tp, all_rt), micro_avg_recall, pretty_div(all_tp, all_gt), micro_avg_f1, "") end if !@options[:macro_avg] && tag2TP_cnt.size > 0 macro_avg_precision = all_precision / tag2TP_cnt.size macro_avg_recall = all_recall / tag2TP_cnt.size macro_avg_f1 = all_f1 / tag2TP_cnt.size print_stat("Macro average", macro_avg_precision, "", macro_avg_recall, "", macro_avg_f1, "") end herisvm-0.9.0/scripts/heri-stat-addons000077500000000000000000000077271350065126600200010ustar00rootroot00000000000000#!/usr/bin/env ruby # Copyright (c) 2015-2017 Aleksey Cheusov # # Permission is hereby granted, free of charge, to any person obtaining # a copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Software, and to # permit persons to whom the Software is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE # LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. # This is an internal herisvm script. It takes output "heri-stat -R" # on input and outputs maximum deviations require 'optparse' @options = {} OptionParser.new do |opts| opts.banner = < %-26s: %s %s %s" % [t, metrics, f, mean, stddev, maxdev] end values = hash2hash2array lines.each do |tokens| values [tokens[1]][tokens[0]] << tokens[2].to_f end mi = hash2hash ma = hash2hash mean = hash2hash max_deviation = hash2hash std_deviation = hash2hash values.each do |key1, hash| hash.each do |key2, arr| mi [key1] [key2] = arr.min ma [key1] [key2] = arr.max mean [key1] [key2] = arr.mean max_deviation [key1] [key2] = [arr.max - arr.mean, arr.mean - arr.min].max std_deviation [key1] [key2] = arr.standard_deviation end end #puts std_deviation.inspect #exit 0 TYPES = {"" => 1, "Macro average" => 1} mi.each do |f, h| pairs = [] max_deviation[f].each do |t, max_dev| pairs << [f, t] if max_dev && TYPES.include?(t) end max_deviation[f].each do |t, max_dev| pairs << [f, t] if max_dev && ! TYPES.include?(t) end pairs.each do |ft| max_dev = max_deviation [ft[0]][ft[1]] * 100 std_dev = std_deviation [ft[0]][ft[1]] * 100 mean_value = mean [ft[0]][ft[1]] * 100 if @options[:raw] max_dev = "%#.3g" % [max_dev] std_dev = "%#.3g" % [std_dev] mean_value = "%#.3g" % [mean_value] print_value_raw(ft[1], "#{f}: mean", mean_value) print_value_raw(ft[1], "#{f}: maxdev", max_dev) print_value_raw(ft[1], "#{f}: stddev", std_dev) else max_dev = "%#6.3g" % [max_dev] std_dev = "%#6.3g" % [std_dev] mean_value = "%-#.3g" % [mean_value] print_value(ft[1], "mean, maxdev, stddev", f, mean_value, max_dev, std_dev) end end puts '' end herisvm-0.9.0/scripts/heri-stat-addons.pod000066400000000000000000000010231350065126600205360ustar00rootroot00000000000000=head1 NAME heri-stat-addons - calculates mean, maximum and standard deviations =head1 SYNOPSIS B [OPTIONS] [I...] =head1 DESCRIPTION B takes files produced by B on input and calculates mean, maximum and standard deviations. =head1 OPTIONS =over 6 =item B<-h, --help> Display help information. =item B<-R, --raw> Raw tab-separated output. =back =head1 HOME L =head1 SEE ALSO L L L herisvm-0.9.0/scripts/heri-stat.pod000066400000000000000000000044461350065126600173040ustar00rootroot00000000000000=head1 NAME heri-stat - calculates precision, recall, F1 and some other things =head1 SYNOPSIS B [-R] [-mrca] [-u label] [-t threshold] I I B -1 [-R] [-mrca] [-u label] [-t threshold] [I...] B -g mode [-R] [-xy] I I B -1 -g mode [-R] [I...] B [-h] =head1 DESCRIPTION The first and second types of B invocation takes classification dataset and predictions on input, and calculate precision, recall and F1. Unless option B<-1> was applied, B reads correct classes from I (one class per line) and predicted classes from I (one class per line). It is allowed for I to contain two tokens per line. The first one is a predicted class, the second one is a score, e.g. probability. The third and forth type of B invocation takes regression outcomes and predictions on input (one value per line) and calculate mean absolute error (MAE), mean squared error (MSE) and/or mean absolute error (MAE). If B<-1> was applied, two or three tokens per line are expected on input: correct value (or class for classification), predicted value (or class), and optional score. =head1 OPTIONS =over 6 =item B<-h, --help> Display help information. =item B<-R, --raw> Raw tab-separated output. =item B<-m, --micro-avg> Disable micro averaged P/R/F1 output. =item B<-r, --macro-avg> Disable macro averaged P/R/F1 output. =item B<-c, --per-class> Disable output of per-class statistics. =item B<-a, --accuracy> Disable output of accuracy. =item B<-1, --single> 2 or 3 tokens per line are expected on input. =item B<-u, --unclassified> I