agedu-20211129.8cd63c5/0000755000175000017500000000000014151034324013151 5ustar simonsimonagedu-20211129.8cd63c5/agedu.10000644000175000017500000011172714151034324014331 0ustar simonsimon.\" agedu version 20211129.8cd63c5 .ie \n(.g .ds Aq \(aq .el .ds Aq ' .TH "agedu" "1" "2008\(hy11\(hy02" "Simon\ Tatham" "Simon\ Tatham" .SH "NAME" .PP \fBagedu\fP - correlate disk usage with last-access times to identify large and disused data .SH "SYNOPSIS" .PP .nf \fBagedu\fP\ [\ \fIoptions\fP\ ]\ \fIaction\fP\ [\fIaction\fP...] .fi .SH "DESCRIPTION" .PP \fBagedu\fP scans a directory tree and produces reports about how much disk space is used in each directory and subdirectory, and also how that usage of disk space corresponds to files with last-access times a long time ago. .PP In other words, \fBagedu\fP is a tool you might use to help you free up disk space. It lets you see which directories are taking up the most space, as \fBdu\fP does; but unlike \fBdu\fP, it also distinguishes between large collections of data which are still in use and ones which have not been accessed in months or years - for instance, large archives downloaded, unpacked, used once, and never cleaned up. Where \fBdu\fP helps you find what\*(Aqs using your disk space, \fBagedu\fP helps you find what\*(Aqs \fIwasting\fP your disk space. .PP \fBagedu\fP has several operating modes. In one mode, it scans your disk and builds an index file containing a data structure which allows it to efficiently retrieve any information it might need. Typically, you would use it in this mode first, and then run it in one of a number of `query' modes to display a report of the disk space usage of a particular directory and its subdirectories. Those reports can be produced as plain text (much like \fBdu\fP) or as HTML. \fBagedu\fP can even run as a miniature web server, presenting each directory\*(Aqs HTML report with hyperlinks to let you navigate around the file system to similar reports for other directories. .PP So you would typically start using \fBagedu\fP by telling it to do a scan of a directory tree and build an index. This is done with a command such as .PP .nf $\ \fBagedu\ \-s\ /home/fred\fP .fi .PP which will build a large data file called \fBagedu.dat\fP in your current directory. (If that current directory is \fIinside\fP \fB/home/fred\fP, don\*(Aqt worry - \fBagedu\fP is smart enough to discount its own index file.) .PP Having built the index, you would now query it for reports of disk space usage. If you have a graphical web browser, the simplest and nicest way to query the index is by running \fBagedu\fP in web server mode: .PP .nf $\ \fBagedu\ \-w\fP .fi .PP which will print (among other messages) a URL on its standard output along the lines of .PP .nf URL:\ http://127.0.0.1:48638/ .fi .PP (That URL will always begin with `\fB127.\fP', meaning that it\*(Aqs in the \fBlocalhost\fP address space. So only processes running on the same computer can even try to connect to that web server, and also there is access control to prevent other users from seeing it - see below for more detail.) .PP Now paste that URL into your web browser, and you will be shown a graphical representation of the disk usage in \fB/home/fred\fP and its immediate subdirectories, with varying colours used to show the difference between disused and recently-accessed data. Click on any subdirectory to descend into it and see a report for its subdirectories in turn; click on parts of the pathname at the top of any page to return to higher-level directories. When you\*(Aqve finished browsing, you can just press Ctrl-D to send an end-of-file indication to \fBagedu\fP, and it will shut down. .PP After that, you probably want to delete the data file \fBagedu.dat\fP, since it\*(Aqs pretty large. In fact, the command \fBagedu -R\fP will do this for you; and you can chain \fBagedu\fP commands on the same command line, so that instead of the above you could have done .PP .nf $\ \fBagedu\ \-s\ /home/fred\ \-w\ \-R\fP .fi .PP for a single self-contained run of \fBagedu\fP which builds its index, serves web pages from it, and cleans it up when finished. .PP In some situations, you might want to scan the directory structure of one computer, but run \fBagedu\fP\*(Aqs user interface on another. In that case, you can do your scan using the \fBagedu -S\fP option in place of \fBagedu -s\fP, which will make \fBagedu\fP not bother building an index file but instead just write out its scan results in plain text on standard output; then you can funnel that output to the other machine using SSH (or whatever other technique you prefer), and there, run \fBagedu -L\fP to load in the textual dump and turn it into an index file. For example, you might run a command like this (plus any \fBssh\fP options you need) on the machine you want to scan: .PP .nf $\ \fBagedu\ \-S\ /home/fred\ |\ ssh\ indexing\-machine\ agedu\ \-L\fP .fi .PP or, equivalently, run something like this on the other machine: .PP .nf $\ \fBssh\ machine\-to\-scan\ agedu\ \-S\ /home/fred\ |\ agedu\ \-L\fP .fi .PP Either way, the \fBagedu -L\fP command will create an \fBagedu.dat\fP index file, which you can then use with \fBagedu -w\fP just as above. .PP (Another way to do this might be to build the index file on the first machine as normal, and then just copy it to the other machine once it's complete. However, for efficiency, the index file is formatted differently depending on the CPU architecture that \fBagedu\fP is compiled for. So if that doesn\*(Aqt match between the two machines - e.g. if one is a 32-bit machine and one 64-bit - then \fBagedu.dat\fP files written on one machine will not work on the other. The technique described above using \fB-S\fP and \fB-L\fP should work between any two machines.) .PP If you don't have a graphical web browser, you can do text-based queries instead of using \fBagedu\fP\*(Aqs web interface. Having scanned \fB/home/fred\fP in any of the ways suggested above, you might run .PP .nf $\ \fBagedu\ \-t\ /home/fred\fP .fi .PP which again gives a summary of the disk usage in \fB/home/fred\fP and its immediate subdirectories; but this time \fBagedu\fP will print it on standard output, in much the same format as \fBdu\fP. If you then want to find out how much \fIold\fP data is there, you can add the \fB-a\fP option to show only files last accessed a certain length of time ago. For example, to show only files which haven\*(Aqt been looked at in six months or more: .PP .nf $\ \fBagedu\ \-t\ /home/fred\ \-a\ 6m\fP .fi .PP That's the essence of what \fBagedu\fP does. It has other modes of operation for more complex situations, and the usual array of configurable options. The following sections contain a complete reference for all its functionality. .SH "OPERATING MODES" .PP This section describes the operating modes supported by \fBagedu\fP. Each of these is in the form of a command-line option, sometimes with an argument. Multiple operating-mode options may appear on the command line, in which case \fBagedu\fP will perform the specified actions one after another. For instance, as shown in the previous section, you might want to perform a disk scan and immediately launch a web server giving reports from that scan. .IP "\fB-s\fP \fIdirectory\fP or \fB--scan\fP \fIdirectory\fP" In this mode, \fBagedu\fP scans the file system starting at the specified directory, and indexes the results of the scan into a large data file which other operating modes can query. .RS .PP By default, the scan is restricted to a single file system (since the expected use of \fBagedu\fP is that you would probably use it because a particular disk partition was running low on space). You can remove that restriction using the \fB--cross-fs\fP option; other configuration options allow you to include or exclude files or entire subdirectories from the scan. See the next section for full details of the configurable options. .PP The index file is created with restrictive permissions, in case the file system you are scanning contains confidential information in its structure. .PP Index files are dependent on the characteristics of the CPU architecture you created them on. You should not expect to be able to move an index file between different types of computer and have it continue to work. If you need to transfer the results of a disk scan to a different kind of computer, see the \fB-D\fP and \fB-L\fP options below. .RE .IP "\fB-w\fP or \fB--web\fP" In this mode, \fBagedu\fP expects to find an index file already written. It allocates a network port, and starts up a web server on that port which serves reports generated from the index file. By default it invents its own URL and prints it out. .RS .PP The web server runs until \fBagedu\fP receives an end-of-file event on its standard input. (The expected usage is that you run it from the command line, immediately browse web pages until you\*(Aqre satisfied, and then press Ctrl-D.) To disable the EOF behaviour, use the \fB--no-eof\fP option. .PP In case the index file contains any confidential information about your file system, the web server protects the pages it serves from access by other people. On Linux, this is done transparently by means of using \fB/proc/net/tcp\fP to check the owner of each incoming connection; failing that, the web server will require a password to view the reports, and \fBagedu\fP will print the password it invented on standard output along with the URL. .PP Configurable options for this mode let you specify your own address and port number to listen on, and also specify your own choice of authentication method (including turning authentication off completely) and a username and password of your choice. .RE .IP "\fB-t\fP \fIdirectory\fP or \fB--text\fP \fIdirectory\fP" In this mode, \fBagedu\fP generates a textual report on standard output, listing the disk usage in the specified directory and all its subdirectories down to a given depth. By default that depth is 1, so that you see a report for \fIdirectory\fP itself and all of its immediate subdirectories. You can configure a different depth (or no depth limit) using \fB-d\fP, described in the next section. .RS .PP Used on its own, \fB-t\fP merely lists the \fItotal\fP disk usage in each subdirectory; \fBagedu\fP\*(Aqs additional ability to distinguish unused from recently-used data is not activated. To activate it, use the \fB-a\fP option to specify a minimum age. .PP The directory structure stored in \fBagedu\fP\*(Aqs index file is treated as a set of literal strings. This means that you cannot refer to directories by synonyms. So if you ran \fBagedu -s .\fP, then all the path names you later pass to the \fB-t\fP option must be either `\fB.\fP' or begin with `\fB./\fP'. Similarly, symbolic links within the directory you scanned will not be followed; you must refer to each directory by its canonical, symlink-free pathname. .RE .IP "\fB-R\fP or \fB--remove\fP" In this mode, \fBagedu\fP deletes its index file. Running just \fBagedu -R\fP on its own is therefore equivalent to typing \fBrm agedu.dat\fP. However, you can also put \fB-R\fP on the end of a command line to indicate that \fBagedu\fP should delete its index file after it finishes performing other operations. .IP "\fB-S\fP \fIdirectory\fP or \fB--scan-dump\fP \fIdirectory\fP" In this mode, \fBagedu\fP will scan a directory tree and convert the results straight into a textual dump on standard output, without generating an index file at all. The dump data is intended for \fBagedu -L\fP to read. .IP "\fB-L\fP or \fB--load\fP" In this mode, \fBagedu\fP expects to read a dump produced by the \fB-S\fP option from its standard input. It constructs an index file from that dump, exactly as it would have if it had read the same data from a disk scan in \fB-s\fP mode. .IP "\fB-D\fP or \fB--dump\fP" In this mode, \fBagedu\fP reads an existing index file and produces a dump of its contents on standard output, in the same format used by \fB-S\fP and \fB-L\fP. This option could be used to convert an existing index file into a format acceptable to a different kind of computer, by dumping it using \fB-D\fP and then loading the dump back in on the other machine using \fB-L\fP. .RS .PP (The output of \fBagedu -D\fP on an existing index file will not be exactly \fIidentical\fP to what \fBagedu -S\fP would have originally produced, due to a difference in treatment of last-access times on directories. However, it should be effectively equivalent for most purposes. See the documentation of the \fB--dir-atime\fP option in the next section for further detail.) .RE .IP "\fB-H\fP \fIdirectory\fP or \fB--html\fP \fIdirectory\fP" In this mode, \fBagedu\fP will generate an HTML report of the disk usage in the specified directory and its immediate subdirectories, in the same form that it serves from its web server in \fB-w\fP mode. .RS .PP By default, a single HTML report will be generated and simply written to standard output, with no hyperlinks pointing to other similar pages. If you also specify the \fB-d\fP option (see below), \fBagedu\fP will instead write out a collection of HTML files with hyperlinks between them, and call the top-level file \fBindex.html\fP. .RE .IP "\fB--cgi\fP" In this mode, \fBagedu\fP will run as the bulk of a CGI script which provides the same set of web pages as the built-in web server would. It will read the usual CGI environment variables, and write CGI-style data to its standard output. .RS .PP The actual CGI program itself should be a tiny wrapper around \fBagedu\fP which passes it the \fB--cgi\fP option, and also (probably) \fB-f\fP to locate the index file. \fBagedu\fP will do everything else. For example, your script might read .PP .nf #!/bin/sh \fI/some/path/to/\fPagedu\ \-\-cgi\ \-f\ \fI/some/other/path/to/\fPagedu.dat .fi .PP (Note that \fBagedu\fP will produce the \fIentire\fP CGI output, including status code, HTTP headers and the full HTML document. If you try to surround the call to \fBagedu --cgi\fP with code that adds your own HTML header and footer, you won\*(Aqt get the results you want, and \fBagedu\fP\*(Aqs HTTP-level features such as auto-redirecting to canonical versions of URIs will stop working.) .PP No access control is performed in this mode: restricting access to CGI scripts is assumed to be the job of the web server. .RE .IP "\fB--presort\fP and \fB--postsort\fP" In these two modes, \fBagedu\fP will expect to read a textual data dump from its standard input of the form produced by \fB-S\fP (and \fB-D\fP). It will transform the data into a different version of its text dump format, and write the transformed version on standard output. .RS .PP The ordinary dump file format is reasonably readable, but loading it into an index file using \fBagedu -L\fP requires it to be sorted in a specific order, which is complicated to describe and difficult to implement using ordinary Unix sorting tools. So if you want to construct your own data dump from a source of your own that \fBagedu\fP itself doesn\*(Aqt know how to scan, you will need to make sure it\*(Aqs sorted in the right order. .PP To help with this, \fBagedu\fP provides a secondary dump format which is `sortable', in the sense that ordinary \fBsort\fP(\fI1\fP) without arguments will arrange it into the right order. However, the sortable format is much more unreadable and also twice the size, so you wouldn\*(Aqt want to write it directly! .PP So the recommended procedure is to generate dump data in the ordinary format; then pipe it through \fBagedu --presort\fP to turn it into the sortable format; then sort it; \fIthen\fP pipe it into \fBagedu -L\fP (which can accept either the normal or the sortable format as input). For example: .PP .nf \fIgenerate_custom_data.sh\fP\ |\ agedu\ \-\-presort\ |\ sort\ |\ agedu\ \-L .fi .PP If you need to transform the sorted dump file back into the ordinary format, \fBagedu --postsort\fP can do that. But since \fBagedu -L\fP can accept either format as input, you may not need to. .RE .IP "\fB-h\fP or \fB--help\fP" Causes \fBagedu\fP to print some help text and terminate immediately. .IP "\fB-V\fP or \fB--version\fP" Causes \fBagedu\fP to print its version number and terminate immediately. .SH "OPTIONS" .PP This section describes the various configuration options that affect \fBagedu\fP\*(Aqs operation in one mode or another. .PP The following option affects nearly all modes (except \fB-S\fP): .IP "\fB-f\fP \fIfilename\fP or \fB--file\fP \fIfilename\fP" Specifies the location of the index file which \fBagedu\fP creates, reads or removes depending on its operating mode. By default, this is simply `\fBagedu.dat\fP', in whatever is the current working directory when you run \fBagedu\fP. .PP The following options affect the disk-scanning modes, \fB-s\fP and \fB-S\fP: .IP "\fB--cross-fs\fP and \fB--no-cross-fs\fP" These configure whether or not the disk scan is permitted to cross between different file systems. The default is not to: \fBagedu\fP will normally skip over subdirectories on which a different file system is mounted. This makes it convenient when you want to free up space on a particular file system which is running low. However, in other circumstances you might wish to see general information about the use of space no matter which file system it\*(Aqs on (for instance, if your real concern is your backup media running out of space, and if your backups do not treat different file systems specially); in that situation, use \fB--cross-fs\fP. .RS .PP (Note that this default is the opposite way round from the corresponding option in \fBdu\fP.) .RE .IP "\fB--prune\fP \fIwildcard\fP and \fB--prune-path\fP \fIwildcard\fP" These cause particular files or directories to be omitted entirely from the scan. If \fBagedu\fP\*(Aqs scan encounters a file or directory whose name matches the wildcard provided to the \fB--prune\fP option, it will not include that file in its index, and also if it\*(Aqs a directory it will skip over it and not scan its contents. .RS .PP Note that in most Unix shells, wildcards will probably need to be escaped on the command line, to prevent the shell from expanding the wildcard before \fBagedu\fP sees it. .PP \fB--prune-path\fP is similar to \fB--prune\fP, except that the wildcard is matched against the entire pathname instead of just the filename at the end of it. So whereas \fB--prune *a*b*\fP will match any file whose actual name contains an \fBa\fP somewhere before a \fBb\fP, \fB--prune-path *a*b*\fP will also match a file whose name contains \fBb\fP and which is inside a directory containing an \fBa\fP, or any file inside a directory of that form, and so on. .RE .IP "\fB--exclude\fP \fIwildcard\fP and \fB--exclude-path\fP \fIwildcard\fP" These cause particular files or directories to be omitted from the index, but not from the scan. If \fBagedu\fP\*(Aqs scan encounters a file or directory whose name matches the wildcard provided to the \fB--exclude\fP option, it will not include that file in its index - but unlike \fB--prune\fP, if the file in question is a directory it will still scan its contents and index them if they are not ruled out themselves by \fB--exclude\fP options. .RS .PP As above, \fB--exclude-path\fP is similar to \fB--exclude\fP, except that the wildcard is matched against the entire pathname. .RE .IP "\fB--include\fP \fIwildcard\fP and \fB--include-path\fP \fIwildcard\fP" These cause particular files or directories to be re-included in the index and the scan, if they had previously been ruled out by one of the above exclude or prune options. You can interleave include, exclude and prune options as you wish on the command line, and if more than one of them applies to a file then the last one takes priority. .RS .PP For example, if you wanted to see only the disk space taken up by MP3 files, you might run .PP .nf $\ \fBagedu\ \-s\ .\ \-\-exclude\ \*(Aq*\*(Aq\ \-\-include\ \*(Aq*.mp3\*(Aq\fP .fi .PP which will cause everything to be omitted from the scan, but then the MP3 files to be put back in. If you then wanted only a subset of those MP3s, you could then exclude some of them again by adding, say, `\fB--exclude-path \*(Aq./queen/*\*(Aq\fP' (or, more efficiently, `\fB--prune ./queen\fP') on the end of that command. .PP As with the previous two options, \fB--include-path\fP is similar to \fB--include\fP except that the wildcard is matched against the entire pathname. .RE .IP "\fB--progress\fP, \fB--no-progress\fP and \fB--tty-progress\fP" When \fBagedu\fP is scanning a directory tree, it will typically print a one-line progress report every second showing where it has reached in the scan, so you can have some idea of how much longer it will take. (Of course, it can\*(Aqt predict \fIexactly\fP how long it will take, since it doesn\*(Aqt know which of the directories it hasn\*(Aqt scanned yet will turn out to be huge.) .RS .PP By default, those progress reports are displayed on \fBagedu\fP\*(Aqs standard error channel, if that channel points to a terminal device. If you need to manually enable or disable them, you can use the above three options to do so: \fB--progress\fP unconditionally enables the progress reports, \fB--no-progress\fP unconditionally disables them, and \fB--tty-progress\fP reverts to the default behaviour which is conditional on standard error being a terminal. .RE .IP "\fB--dir-atime\fP and \fB--no-dir-atime\fP" In normal operation, \fBagedu\fP ignores the atimes (last access times) on the \fIdirectories\fP it scans: it only pays attention to the atimes of the \fIfiles\fP inside those directories. This is because directory atimes tend to be reset by a lot of system administrative tasks, such as \fBcron\fP jobs which scan the file system for one reason or another - or even other invocations of \fBagedu\fP itself, though it tries to avoid modifying any atimes if possible. So the literal atimes on directories are typically not representative of how long ago the data in question was last accessed with real intent to use that data in particular. .RS .PP Instead, \fBagedu\fP makes up a fake atime for every directory it scans, which is equal to the newest atime of any file in or below that directory (or the directory\*(Aqs last \fImodification\fP time, whichever is newest). This is based on the assumption that all \fIimportant\fP accesses to directories are actually accesses to the files inside those directories, so that when any file is accessed all the directories on the path leading to it should be considered to have been accessed as well. .PP In unusual cases it is possible that a directory itself might embody important data which is accessed by reading the directory. In that situation, \fBagedu\fP\*(Aqs atime-faking policy will misreport the directory as disused. In the unlikely event that such directories form a significant part of your disk space usage, you might want to turn off the faking. The \fB--dir-atime\fP option does this: it causes the disk scan to read the original atimes of the directories it scans. .PP The faking of atimes on directories also requires a processing pass over the index file after the main disk scan is complete. \fB--dir-atime\fP also turns this pass off. Hence, this option affects the \fB-L\fP option as well as \fB-s\fP and \fB-S\fP. .PP (The previous section mentioned that there might be subtle differences between the output of \fBagedu -s /path -D\fP and \fBagedu -S /path\fP. This is why. Doing a scan with \fB-s\fP and then dumping it with \fB-D\fP will dump the fully faked atimes on the directories, whereas doing a scan-to-dump with \fB-S\fP will dump only \fIpartially\fP faked atimes - specifically, each directory\*(Aqs last modification time - since the subsequent processing pass will not have had a chance to take place. However, loading either of the resulting dump files with \fB-L\fP will perform the atime-faking processing pass, leading to the same data in the index file in each case. In normal usage it should be safe to ignore all of this complexity.) .RE .IP "\fB--mtime\fP" This option causes \fBagedu\fP to index files by their last modification time instead of their last access time. You might want to use this if your last access times were completely useless for some reason: for example, if you had recently searched every file on your system, the system would have lost all the information about what files you hadn\*(Aqt recently accessed before then. Using this option is liable to be less effective at finding genuinely wasted space than the normal mode (that is, it will be more likely to flag things as disused when they\*(Aqre not, so you will have more candidates to go through by hand looking for data you don\*(Aqt need), but may be better than nothing if your last-access times are unhelpful. .RS .PP Another use for this mode might be to find \fIrecently created\fP large data. If your disk has been gradually filling up for years, the default mode of \fBagedu\fP will let you find unused data to delete; but if you know your disk had plenty of space recently and now it\*(Aqs suddenly full, and you suspect that some rogue program has left a large core dump or output file, then \fBagedu --mtime\fP might be a convenient way to locate the culprit. .RE .IP "\fB--logicalsize\fP" This option causes \fBagedu\fP to consider the size of each file to be its `logical' size, rather than the amount of space it consumes on disk. (That is, it will use \fBst_size\fP instead of \fBst_blocks\fP in the data returned from \fBstat\fP(\fI2\fP).) This option makes \fBagedu\fP less accurate at reporting how much of your disk is used, but it might be useful in specialist cases, such as working around a file system that is misreporting physical sizes. .RS .PP For most files, the physical size of a file will be larger than the logical size, reflecting the fact that filesystem layouts generally allocate a whole number of blocks of the disk to each file, so some space is wasted at the end of the last block. So counting only the logical file size will typically cause under-reporting of the disk usage (perhaps \fIlarge\fP under-reporting in the case of a very large number of very small files). .PP On the other hand, sometimes a file with a very large logical size can have `holes' where no data is actually stored, in which case using the logical size of the file will \fIover\fP-report its disk usage. So the use of logical sizes can give wrong answers in both directions. .RE .PP The following option affects all the modes that generate reports: the web server mode \fB-w\fP, the stand-alone HTML generation mode \fB-H\fP and the text report mode \fB-t\fP. .IP "\fB--files\fP" This option causes \fBagedu\fP\*(Aqs reports to list the individual files in each directory, instead of just giving a combined report for everything that\*(Aqs not in a subdirectory. .PP The following option affects the text report mode \fB-t\fP. .IP "\fB-a\fP \fIage\fP or \fB--age\fP \fIage\fP" This option tells \fBagedu\fP to report only files of at least the specified age. An age is specified as a number, followed by one of `\fBy\fP' (years), `\fBm\fP' (months), `\fBw\fP' (weeks) or `\fBd\fP' (days). (This syntax is also used by the \fB-r\fP option.) For example, \fB-a 6m\fP will produce a text report which includes only files at least six months old. .PP The following options affect the stand-alone HTML generation mode \fB-H\fP and the text report mode \fB-t\fP. .IP "\fB-d\fP \fIdepth\fP or \fB--depth\fP \fIdepth\fP" This option controls the maximum depth to which \fBagedu\fP recurses when generating a text or HTML report. .RS .PP In text mode, the default is 1, meaning that the report will include the directory given on the command line and all of its immediate subdirectories. A depth of two includes another level below that, and so on; a depth of zero means \fIonly\fP the directory on the command line. .PP In HTML mode, specifying this option switches \fBagedu\fP from writing out a single HTML file to writing out multiple files which link to each other. A depth of 1 means \fBagedu\fP will write out an HTML file for the given directory and also one for each of its immediate subdirectories. .PP If you want \fBagedu\fP to recurse as deeply as possible, give the special word `\fBmax\fP' as an argument to \fB-d\fP. .RE .IP "\fB-o\fP \fIfilename\fP or \fB--output\fP \fIfilename\fP" This option is used to specify an output file for \fBagedu\fP to write its report to. In text mode or single-file HTML mode, the argument is treated as the name of a file. In multiple-file HTML mode, the argument is treated as the name of a directory: the directory will be created if it does not already exist, and the output HTML files will be created inside it. .PP The following option affects only the stand-alone HTML generation mode \fB-H\fP, and even then, only in recursive mode (with \fB-d\fP): .IP "\fB--numeric\fP" This option tells \fBagedu\fP to name most of its output HTML files numerically. The root of the whole output file collection will still be called \fBindex.html\fP, but all the rest will have names like \fB73.html\fP or \fB12525.html\fP. (The numbers are essentially arbitrary; in fact, they\*(Aqre indices of nodes in the data structure used by \fBagedu\fP\*(Aqs index file.) .RS .PP This system of file naming is less intuitive than the default of naming files after the sub-pathname they index. It's also less stable: the same pathname will not necessarily be represented by the same filename if \fBagedu -H\fP is re-run after another scan of the same directory tree. However, it does have the virtue that it keeps the filenames \fIshort\fP, so that even if your directory tree is very deep, the output HTML files won\*(Aqt exceed any OS limit on filename length. .RE .PP The following options affect the web server mode \fB-w\fP, and in some cases also the stand-alone HTML generation mode \fB-H\fP: .IP "\fB-r\fP \fIage range\fP or \fB--age-range\fP \fIage range\fP" The HTML reports produced by \fBagedu\fP use a range of colours to indicate how long ago data was last accessed, running from red (representing the most disused data) to green (representing the newest). By default, the lengths of time represented by the two ends of that spectrum are chosen by examining the data file to see what range of ages appears in it. However, you might want to set your own limits, and you can do this using \fB-r\fP. .RS .PP The argument to \fB-r\fP consists of a single age, or two ages separated by a minus sign. An age is a number, followed by one of `\fBy\fP' (years), `\fBm\fP' (months), `\fBw\fP' (weeks) or `\fBd\fP' (days). (This syntax is also used by the \fB-a\fP option.) The first age in the range represents the oldest data, and will be coloured red in the HTML; the second age represents the newest, coloured green. If the second age is not specified, it will default to zero (so that green means data which has been accessed \fIjust now\fP). .PP For example, \fB-r 2y\fP will mark data in red if it has been unused for two years or more, and green if it has been accessed just now. \fB-r 2y-3m\fP will similarly mark data red if it has been unused for two years or more, but will mark it green if it has been accessed three months ago or later. .RE .IP "\fB--address\fP \fIaddr\fP[\fB:\fP\fIport\fP]" Specifies the network address and port number on which \fBagedu\fP should listen when running its web server. If you want \fBagedu\fP to listen for connections coming in from any source, specify the address as the special value \fBANY\fP. If the port number is omitted, an arbitrary unused port will be chosen for you and displayed. .RS .PP If you specify this option, \fBagedu\fP will not print its URL on standard output (since you are expected to know what address you told it to listen to). .RE .IP "\fB--auth\fP \fIauth-type\fP" Specifies how \fBagedu\fP should control access to the web pages it serves. The options are as follows: .RS .IP "\fBmagic\fP" This option only works on Linux, and only when the incoming connection is from the same machine that \fBagedu\fP is running on. On Linux, the special file \fB/proc/net/tcp\fP contains a list of network connections currently known to the operating system kernel, including which user id created them. So \fBagedu\fP will look up each incoming connection in that file, and allow access if it comes from the same user id under which \fBagedu\fP itself is running. Therefore, in \fBagedu\fP\*(Aqs normal web server mode, you can safely run it on a multi-user machine and no other user will be able to read data out of your index file. .IP "\fBbasic\fP" In this mode, \fBagedu\fP will use HTTP Basic authentication: the user will have to provide a username and password via their browser. \fBagedu\fP will normally make up a username and password for the purpose, but you can specify your own; see below. .IP "\fBnone\fP" In this mode, the web server is unauthenticated: anyone connecting to it has full access to the reports generated by \fBagedu\fP. Do not do this unless there is nothing confidential at all in your index file, or unless you are certain that nobody but you can run processes on your computer. .IP "\fBdefault\fP" This is the default mode if you do not specify one of the above. In this mode, \fBagedu\fP will attempt to use Linux magic authentication, but if it detects at startup time that \fB/proc/net/tcp\fP is absent or non-functional then it will fall back to using HTTP Basic authentication and invent a user name and password. .RE .IP "\fB--auth-file\fP \fIfilename\fP or \fB--auth-fd\fP \fIfd\fP" When \fBagedu\fP is using HTTP Basic authentication, these options allow you to specify your own user name and password. If you specify \fB--auth-file\fP, these will be read from the specified file; if you specify \fB--auth-fd\fP they will instead be read from a given file descriptor which you should have arranged to pass to \fBagedu\fP. In either case, the authentication details should consist of the username, followed by a colon, followed by the password, followed \fIimmediately\fP by end of file (no trailing newline, or else it will be considered part of the password). .IP "\fB--title\fP \fItitle\fP" Specify the string that appears at the start of the \fB\fP section of the output HTML pages. The default is `\fBagedu\fP'. This title is followed by a colon and then the path you\*(Aqre viewing within the index file. You might use this option if you were serving \fBagedu\fP reports for several different servers and wanted to make it clearer which one a user was looking at. .IP "\fB--launch\fP \fIshell-command\fP" Specify a command to be run with the base URL of the web user interface, once the web server has started up. The command will be interpreted by \fB/bin/sh\fP, and the base URL will be appended to it as an extra argument word. .RS .PP A typical use for this would be `\fB--launch=browse\fP', which uses the XDG `\fBbrowse\fP' command to automatically open the \fBagedu\fP web interface in your default browser. However, other uses are possible: for example, you could provide a command which communicates the URL to some other software that will use it for something. .RE .IP "\fB--no-eof\fP" Stop \fBagedu\fP in web server mode from looking for end-of-file on standard input and treating it as a signal to terminate. .SH "LIMITATIONS" .PP The data file is pretty large. The core of \fBagedu\fP is the tree-based data structure it uses in its index in order to efficiently perform the queries it needs; this data structure requires \fBO(N log N)\fP storage. This is larger than you might expect; a scan of my own home directory, containing half a million files and directories and about 20Gb of data, produced an index file over 60Mb in size. Furthermore, since the data file must be memory-mapped during most processing, it can never grow larger than available address space, so a \fIreally\fP big filesystem may need to be indexed on a 64-bit computer. (This is one reason for the existence of the \fB-D\fP and \fB-L\fP options: you can do the scanning on the machine with access to the filesystem, and the indexing on a machine big enough to handle it.) .PP The data structure also does not usefully permit access control within the data file, so it would be difficult - even given the willingness to do additional coding - to run a system-wide \fBagedu\fP scan on a \fBcron\fP job and serve the right subset of reports to each user. .PP In certain circumstances, \fBagedu\fP can report false positives (reporting files as disused which are in fact in use) as well as the more benign false negatives (reporting files as in use which are not). This arises when a file is, semantically speaking, `read' without actually being physically \fIread\fP. Typically this occurs when a program checks whether the file\*(Aqs mtime has changed and only bothers re-reading it if it has; programs which do this include \fBrsync\fP(\fI1\fP) and \fBmake\fP(\fI1\fP). Such programs will fail to update the atime of unmodified files despite depending on their continued existence; a directory full of such files will be reported as disused by \fBagedu\fP even in situations where deleting them will cause trouble. .PP Finally, of course, \fBagedu\fP\*(Aqs normal usage mode depends critically on the OS providing last-access times which are at least approximately right. So a file system mounted with Linux\*(Aqs `\fBnoatime\fP' option, or the equivalent on any other OS, will not give useful results! (However, the Linux mount option `\fBrelatime\fP', which distributions now tend to use by default, should be fine for all but specialist purposes: it reduces the accuracy of last-access times so that they might be wrong by up to 24 hours, but if you\*(Aqre looking for files that have been unused for months or years, that\*(Aqs not a problem.) .SH "LICENCE" .PP \fBagedu\fP is free software, distributed under the MIT licence. Type \fBagedu --licence\fP to see the full licence text. �����������������������������������������agedu-20211129.8cd63c5/winscan.c��������������������������������������������������������������������0000644�0001750�0001750�00000014753�14151034324�014771� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������#include <windows.h> #include <stdio.h> #include <stdbool.h> #include <string.h> #define lenof(x) (sizeof((x))/sizeof(*(x))) #define snew(type) \ ( (type *) malloc (sizeof (type)) ) #define snewn(number, type) \ ( (type *) malloc ((number) * sizeof (type)) ) #define sresize(array, number, type) \ ( (void)sizeof((array)-(type *)0), \ (type *) realloc ((array), (number) * sizeof (type)) ) #define sfree free char *dupstr(const char *s) { char *r = malloc(1+strlen(s)); strcpy(r,s); return r; } typedef struct { HANDLE hdl; WIN32_FIND_DATA fdata; bool got_one, eod; } dirhandle; int open_dir(char *path, dirhandle *dh) { strcat(path, "\\*"); dh->hdl = FindFirstFile(path, &dh->fdata); if (dh->hdl == INVALID_HANDLE_VALUE) { int err = GetLastError(); if (err == ERROR_FILE_NOT_FOUND) { dh->eod = true; dh->got_one = false; return 0; } else { return -err; } } else { dh->eod = false; dh->got_one = true; return 0; } } const char *read_dir(dirhandle *dh) { if (!dh->got_one) { if (dh->eod) return NULL; if (FindNextFile(dh->hdl, &dh->fdata)) { dh->got_one = true; } else { dh->eod = true; return NULL; } } dh->got_one = false; return dh->fdata.cFileName; } void close_dir(dirhandle *dh) { CloseHandle(dh->hdl); } static int str_cmp(const void *av, const void *bv) { return strcmp(*(const char **)av, *(const char **)bv); } typedef int (*gotdata_fn_t)(void *ctx, const char *pathname, WIN32_FIND_DATA *dat); static void format_error(char *buf, size_t size, int err) { FormatMessage(FORMAT_MESSAGE_FROM_SYSTEM, NULL, err, 0, buf, size, NULL); buf[size-1] = '\0'; if (buf[strlen(buf)-1] == '\n') buf[strlen(buf)-1] = '\0'; } static void du_recurse(char **path, size_t pathlen, size_t *pathsize, gotdata_fn_t gotdata, void *gotdata_ctx) { const char *name; dirhandle d; char **names; int error; size_t i, nnames, namesize; WIN32_FIND_DATA dat; HANDLE h; if (*pathsize <= pathlen + 10) { *pathsize = pathlen * 3 / 2 + 256; *path = sresize(*path, *pathsize, char); } h = FindFirstFile(*path, &dat); if (h != INVALID_HANDLE_VALUE) { /* The normal case: we retrieved information about the file in * 'dat', and we also got a handle we can pass to * FindNextFile. We don't need the latter, so close it * immediately. */ CloseHandle(h); } else if (pathlen > 0 && (*path)[pathlen-1] == '\\') { /* Special case: we're scanning a subdirectory such as c:\ . * In that case, take it to have zero size, and resort to * GetFileTime to find its times. */ dat.nFileSizeHigh = dat.nFileSizeLow = 0; dat.ftLastWriteTime.dwHighDateTime = 0x19DB1DE; dat.ftLastWriteTime.dwLowDateTime = 0xD53E8000; dat.ftLastAccessTime.dwHighDateTime = 0x19DB1DE; dat.ftLastAccessTime.dwLowDateTime = 0xD53E8000; h = CreateFile(*path, GENERIC_READ, FILE_SHARE_WRITE, NULL, OPEN_EXISTING, 0, NULL); if (h != INVALID_HANDLE_VALUE) { GetFileTime(h, &dat.ftCreationTime, &dat.ftLastAccessTime, &dat.ftLastWriteTime); CloseHandle(h); } } else { /* Total failure: FindFirstFile returned INVALID_HANDLE_VALUE * and it _wasn't_ because the file is a special directory. */ char buf[4096]; format_error(buf, lenof(buf), GetLastError()); fprintf(stderr, "Unable to scan '%s': %s\n", *path, buf); return; } if (!gotdata(gotdata_ctx, *path, &dat)) return; if (!(GetFileAttributes(*path) & FILE_ATTRIBUTE_DIRECTORY)) return; names = NULL; nnames = namesize = 0; if ((error = open_dir(*path, &d)) < 0) { char buf[4096]; format_error(buf, lenof(buf), -error); fprintf(stderr, "Unable to open directory '%s': %s\n", *path, buf); return; } while ((name = read_dir(&d)) != NULL) { if (name[0] == '.' && (!name[1] || (name[1] == '.' && !name[2]))) { /* do nothing - we skip "." and ".." */ } else { if (nnames >= namesize) { namesize = nnames * 3 / 2 + 64; names = sresize(names, namesize, char *); } names[nnames++] = dupstr(name); } } close_dir(&d); if (nnames == 0) return; qsort(names, nnames, sizeof(*names), str_cmp); for (i = 0; i < nnames; i++) { size_t newpathlen = pathlen + 1 + strlen(names[i]); if (*pathsize <= newpathlen) { *pathsize = newpathlen * 3 / 2 + 256; *path = sresize(*path, *pathsize, char); } /* * Avoid duplicating a slash if we got a trailing one to * begin with (i.e. if we're starting the scan in '\\' itself). */ if (pathlen > 0 && (*path)[pathlen-1] == '\\') { strcpy(*path + pathlen, names[i]); newpathlen--; } else { sprintf(*path + pathlen, "\\%s", names[i]); } du_recurse(path, newpathlen, pathsize, gotdata, gotdata_ctx); sfree(names[i]); } sfree(names); } void du(const char *inpath, gotdata_fn_t gotdata, void *gotdata_ctx) { char *path; size_t pathlen, pathsize; pathlen = strlen(inpath); pathsize = pathlen + 256; path = snewn(pathsize, char); strcpy(path, inpath); du_recurse(&path, pathlen, &pathsize, gotdata, gotdata_ctx); } int gd(void *ctx, const char *pathname, WIN32_FIND_DATA *dat) { unsigned long long size, t; FILETIME ft; const char *p; size = dat->nFileSizeHigh; size = (size << 32) | dat->nFileSizeLow; printf("%llu ", size); if (dat->dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) ft = dat->ftLastWriteTime; else ft = dat->ftLastAccessTime; t = ft.dwHighDateTime; t = (t << 32) | ft.dwLowDateTime; t /= 10000000; /* * Correction factor: number of seconds between Windows's file * time epoch of 1 Jan 1601 and Unix's time_t epoch of 1 Jan 1970. * * That's 369 years, of which 92 were divisible by 4, but * three of those were century points. */ t -= (369 * 365 + 92 - 3) * (unsigned long long)86400; printf("%llu ", t); for (p = pathname; *p; p++) { if (*p >= ' ' && *p < 127 && *p != '%') putchar(*p); else printf("%%%02x", (unsigned char)*p); } putchar('\n'); return 1; } int main(int argc, char **argv) { char *dir; int dirlen, dirsize; if (argc > 1) SetCurrentDirectory(argv[1]); dirsize = 512; dir = snewn(dirsize, char); while ((dirlen = GetCurrentDirectory(dirsize, dir)) >= dirsize) { dirsize = dirlen + 256; dir = sresize(dir, dirsize, char); } printf("agedu dump file. pathsep=5c\n"); du(dir, gd, NULL); } ���������������������agedu-20211129.8cd63c5/version.h��������������������������������������������������������������������0000644�0001750�0001750�00000000453�14151034324�015011� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * Header file that defines AGEDU_VERSION to a string that goes in * 'agedu --version', in the distribution tarball. * * This empty copy of it, in source control, exists so that #including * it won't fail when we're not building from the tarball. */ #define AGEDU_VERSION "20211129.8cd63c5" ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/trie.h�����������������������������������������������������������������������0000644�0001750�0001750�00000011736�14151034324�014275� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * trie.h: functions to build and search a trie-like structure * mapping pathnames to file records. */ #include <sys/types.h> /* * An entry in the trie file describing an actual file. */ struct trie_file { unsigned long long size; unsigned long long atime; }; /* ---------------------------------------------------------------------- * Functions which can be passed a list of pathnames and file data * in strict order, and will build up a trie and write it out to a * file. */ typedef struct triebuild triebuild; /* * Create a new trie-building context given a writable file * descriptor, which should point to the start of an empty file. */ triebuild *triebuild_new(int fd); /* * Add a trie_file record to the trie. The pathnames should appear * in trie collation order (i.e. strict ASCII sorting order except * that / is moved so that it sorts before any other non-NUL * character), on penalty of assertion failure. */ void triebuild_add(triebuild *tb, const char *pathname, const struct trie_file *file); /* * Put the finishing touches to the contents of the output file * once all the trie_file records are present. Returns the total * number of elements in the trie. */ int triebuild_finish(triebuild *tb); /* * Free the context. (Does not close the file, but may leave the * file pointer in an arbitrary place.) */ void triebuild_free(triebuild *tb); /* ---------------------------------------------------------------------- * Anomalous function which modifies a trie after it's memory-mapped. */ /* * Invent new fake atimes for each directory in the trie, by * taking the maximum (latest) of the directory's previously * stored atime and the atimes of everything below it. */ void trie_fake_dir_atimes(void *t); /* ---------------------------------------------------------------------- * Functions to query a trie given a pointer to the start of the * memory-mapped file. */ /* * Check the magic numbers at the start of the file. This should also * verify that the file was built on a platform whose structure layout * matches that of the agedu reading it. Returns nonzero on successful * match, zero on mismatch. */ int trie_check_magic(const void *t); /* * Return the path separator character in use in the trie. */ char trie_pathsep(const void *t); /* * Return the length of the longest pathname stored in the trie, * including its trailing NUL. */ size_t trie_maxpathlen(const void *t); /* * Determine the total number of entries in the trie which sort * strictly before the given pathname (irrespective of whether the * pathname actually exists in the trie). */ unsigned long trie_before(const void *t, const char *pathname); /* * Return the pathname for the nth entry in the trie, written into * the supplied buffer (which must be large enough, i.e. at least * trie_maxpathlen(t) bytes). */ void trie_getpath(const void *t, unsigned long n, char *buf); /* * Return the trie_file * for the nth entry in the trie. */ const struct trie_file *trie_getfile(const void *t, unsigned long n); /* * Return the total number of entries in the whole trie. */ unsigned long trie_count(const void *t); /* * Enumerate all the trie_file records in the trie, in sequence, * and return pointers to their trie_file structures. Returns NULL * at end of list, naturally. * * Optionally returns the pathnames as well: if "buf" is non-NULL * then it is expected to point to a buffer large enough to store * all the pathnames in the trie (i.e. at least trie_maxpathlen(t) * bytes). This buffer is also expected to remain unchanged * between calls to triewalk_next(), or else the returned * pathnames will be corrupted. */ typedef struct triewalk triewalk; triewalk *triewalk_new(const void *t); const struct trie_file *triewalk_next(triewalk *tw, char *buf); void triewalk_rebase(triewalk *tw, const void *t); void triewalk_free(triewalk *tw); /* ---------------------------------------------------------------------- * The trie file also contains an index managed by index.h, so we * must be able to ask about that too. */ void trie_set_index_offset(void *t, off_t ptr); off_t trie_get_index_offset(const void *t); /* ---------------------------------------------------------------------- * Utility functions not directly involved with the trie. */ /* * Given a pathname in a buffer, adjust the pathname in place so * that it points at a string which, when passed to trie_before, * will reliably return the index of the thing that comes after * that pathname and all its descendants. Usually this is done by * suffixing ^A (since foo^A is guaranteeably the first thing that * sorts after foo and foo/stuff); however, if the pathname * actually ends in a path separator (e.g. if it's just "/"), that * must be stripped off first. */ void make_successor(char *pathbuf); /* * String comparison function consistent with the trie's sort order. * Requires the current path separator as a parameter. */ int triecmp(const char *a, const char *b, int *offset, unsigned char pathsep); ����������������������������������agedu-20211129.8cd63c5/trie.c�����������������������������������������������������������������������0000644�0001750�0001750�00000053616�14151034324�014273� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * trie.c: implementation of trie.h. */ #include "agedu.h" #include "alloc.h" #include "trie.h" #define alignof(typ) ( offsetof(struct { char c; typ t; }, t) ) /* * Compare functions for pathnames. Returns the relative order of * the names, like strcmp; also passes back the offset of the * first differing character if desired. */ static int trieccmp(unsigned char a, unsigned char b, unsigned char pathsep) { int ia = (a == '\0' ? '\0' : a == pathsep ? '\1' : a+1); int ib = (b == '\0' ? '\0' : b == pathsep ? '\1' : b+1); return ia - ib; } static int triencmp(const char *a, size_t alen, const char *b, size_t blen, int *offset, unsigned char pathsep) { int off = 0; while (off < alen && off < blen && a[off] == b[off]) off++; if (offset) *offset = off; if (off == alen || off == blen) return (off == blen) - (off == alen); return trieccmp(a[off], b[off], pathsep); } int triecmp(const char *a, const char *b, int *offset, unsigned char pathsep) { return triencmp(a, strlen(a), b, strlen(b), offset, pathsep); } /* ---------------------------------------------------------------------- * Trie node structures. * * The trie format stored in the file consists of three distinct * node types, each with a distinguishing type field at the start. * * TRIE_LEAF is a leaf node; it contains an actual trie_file * structure, and indicates that when you're searching down the * trie with a string, you should now expect to encounter * end-of-string. * * TRIE_SWITCH indicates that the set of strings in the trie * include strings with more than one distinct character after the * prefix leading up to this point. Hence, it stores multiple * subnode pointers and a different single character for each one. * * TRIE_STRING indicates that at this point everything in the trie * has the same next few characters; it stores a single mandatory * string fragment and exactly one subnode pointer. */ enum { TRIE_LEAF = 0x7fffe000, TRIE_SWITCH, TRIE_STRING }; struct trie_common { int type; }; struct trie_switchentry { off_t subnode; int subcount; }; struct trie_leaf { struct trie_common c; struct trie_file file; }; struct trie_switch { struct trie_common c; /* * sw[0] to sw[len-1] give the subnode pointers and element * counts. At &sw[len] is stored len single bytes which are * the characters corresponding to each subnode. */ int len; struct trie_switchentry sw[]; }; struct trie_string { struct trie_common c; int stringlen; off_t subnode; char string[]; }; static const char magic_ident_string[16] = "agedu index file"; struct trie_magic { /* * 'Magic numbers' to go at the start of an agedu index file. * These are checked (using trie_check_magic) by every agedu mode * which reads a pre-existing index. * * As well as identifying an agedu file from any other kind of * file, this magic-number structure is also intended to detect * agedu files which were written on the wrong platform and hence * whose structure layouts are incompatible. To make that as * reliable as possible, I design the structure of magic numbers * as follows: it contains one of each integer type I might use, * each containing a different magic number, and each followed by * a char to indicate where it ends in the file. One byte is set * to the length of the magic-number structure itself, which means * that no two structures of different lengths can possibly * compare equal even if by some chance they match up to the * length of the shorter one. Finally, the whole magic number * structure is memset to another random byte value before * initialising any of these fields, so that padding in between * can be readily identified. */ char ident[16]; /* human-readable string */ unsigned char magic_len; unsigned long long longlong_magic; unsigned char postlonglong_char_magic; off_t offset_magic; unsigned char postoffset_char_magic; size_t size_magic; unsigned char postsize_char_magic; void *null_pointer; unsigned char postptr_char_magic; unsigned long long_magic; unsigned char postlong_char_magic; unsigned short short_magic; unsigned char postshort_char_magic; }; struct trie_header { struct trie_magic magic; off_t root, indexroot; int count; size_t maxpathlen; int pathsep; }; /* Union only used for computing alignment */ union trie_node { struct trie_leaf leaf; struct { /* fake trie_switch with indeterminate array length filled in */ struct trie_common c; int len; struct trie_switchentry sw[1]; } sw; struct { /* fake trie_string with indeterminate array length filled in */ struct trie_common c; int stringlen; off_t subnode; char string[1]; } str; }; #define TRIE_ALIGN alignof(union trie_node) static void setup_magic(struct trie_magic *th) { /* * Magic values are chosen so that every byte value used is * distinct (so that we can't fail to spot endianness issues), and * we cast 64 bits of data into each integer type just in case * the platform makes it longer than we expect it to be. */ memset(th, 0xCDU, sizeof(*th)); th->magic_len = sizeof(*th); memcpy(th->ident, magic_ident_string, 16); th->longlong_magic = 0x5583F34D5D84F73CULL; th->postlonglong_char_magic = 0xDDU; th->offset_magic = (off_t)0xB39BF9AD56D48E0BULL; th->postoffset_char_magic = 0x95U; th->size_magic = (size_t)0x6EC752B0EEAEBAC1ULL; th->postsize_char_magic = 0x77U; th->null_pointer = NULL; th->postptr_char_magic = 0x71U; th->long_magic = (unsigned long)0xA81A5E1F44334716ULL; th->postlong_char_magic = 0x99U; th->short_magic = (unsigned short)0x0C8BD7984B68D9FCULL; th->postshort_char_magic = 0x35U; } /* ---------------------------------------------------------------------- * Trie-building functions. */ struct tbswitch { int len; char c[256]; off_t off[256]; int count[256]; }; struct triebuild { int fd; off_t offset; char *lastpath; int lastlen, lastsize; off_t lastoff; struct tbswitch *switches; int switchsize; size_t maxpathlen; }; static void tb_seek(triebuild *tb, off_t off) { tb->offset = off; if (lseek(tb->fd, off, SEEK_SET) < 0) { fprintf(stderr, PNAME ": lseek: %s\n", strerror(errno)); exit(1); } } static void tb_write(triebuild *tb, const void *buf, size_t len) { tb->offset += len; while (len > 0) { int ret = write(tb->fd, buf, len); if (ret < 0) { fprintf(stderr, PNAME ": write: %s\n", strerror(errno)); exit(1); } len -= ret; buf = (const void *)((const char *)buf + ret); } } static char trie_align_zeroes[TRIE_ALIGN]; static void tb_align(triebuild *tb) { int off = (TRIE_ALIGN - ((tb)->offset % TRIE_ALIGN)) % TRIE_ALIGN; tb_write(tb, trie_align_zeroes, off); } triebuild *triebuild_new(int fd) { triebuild *tb = snew(triebuild); struct trie_header th; tb->fd = fd; tb->lastpath = NULL; tb->lastlen = tb->lastsize = 0; tb->lastoff = 0; tb->switches = NULL; tb->switchsize = 0; tb->maxpathlen = 0; setup_magic(&th.magic); th.root = th.count = 0; th.indexroot = 0; th.maxpathlen = 0; th.pathsep = (unsigned char)pathsep; tb_seek(tb, 0); tb_write(tb, &th, sizeof(th)); return tb; } static off_t triebuild_unwind(triebuild *tb, int targetdepth, int *outcount) { off_t offset; int count, depth; if (tb->lastoff == 0) { *outcount = 0; return 0; } offset = tb->lastoff; count = 1; depth = tb->lastlen + 1; assert(depth >= targetdepth); while (depth > targetdepth) { int odepth = depth; while (depth > targetdepth && (depth-1 >= tb->switchsize || !tb->switches || tb->switches[depth-1].len == 0)) depth--; if (odepth > depth) { /* * Write out a string node. */ size_t nodesize = sizeof(struct trie_string) + odepth - depth; struct trie_string *st = (struct trie_string *)smalloc(nodesize); st->c.type = TRIE_STRING; st->stringlen = odepth - depth; st->subnode = offset; memcpy(st->string, tb->lastpath + depth, odepth - depth); tb_align(tb); offset = tb->offset; tb_write(tb, st, nodesize); sfree(st); } assert(depth >= targetdepth); if (depth <= targetdepth) break; /* * Now we expect to be sitting just below a switch node. * Add our final entry to it and write it out. */ depth--; { struct trie_switch *sw; char *chars; size_t nodesize; int swlen = tb->switches[depth].len; int i; assert(swlen > 0); tb->switches[depth].c[swlen] = tb->lastpath[depth]; tb->switches[depth].off[swlen] = offset; tb->switches[depth].count[swlen] = count; swlen++; nodesize = sizeof(struct trie_switch) + swlen * sizeof(struct trie_switchentry) + swlen; sw = (struct trie_switch *)smalloc(nodesize); chars = (char *)&sw->sw[swlen]; sw->c.type = TRIE_SWITCH; sw->len = swlen; count = 0; for (i = 0; i < swlen; i++) { sw->sw[i].subnode = tb->switches[depth].off[i]; sw->sw[i].subcount = tb->switches[depth].count[i]; chars[i] = tb->switches[depth].c[i]; count += tb->switches[depth].count[i]; } tb_align(tb); offset = tb->offset; tb_write(tb, sw, nodesize); sfree(sw); tb->switches[depth].len = 0; /* clear this node */ } } *outcount = count; return offset; } void triebuild_add(triebuild *tb, const char *pathname, const struct trie_file *file) { int pathlen = strlen(pathname); int depth; if (tb->maxpathlen < pathlen+1) tb->maxpathlen = pathlen+1; if (tb->lastpath) { off_t offset; int count; /* * Find the first differing character between this pathname * and the previous one. */ int ret = triecmp(tb->lastpath, pathname, &depth, pathsep); assert(ret < 0); /* * Finalise all nodes above this depth. */ offset = triebuild_unwind(tb, depth+1, &count); /* * Add the final node we just acquired to the switch node * at our chosen depth, creating it if it isn't already * there. */ if (tb->switchsize <= depth) { int oldsize = tb->switchsize; tb->switchsize = depth * 3 / 2 + 64; tb->switches = sresize(tb->switches, tb->switchsize, struct tbswitch); while (oldsize < tb->switchsize) tb->switches[oldsize++].len = 0; } tb->switches[depth].c[tb->switches[depth].len] = tb->lastpath[depth]; tb->switches[depth].off[tb->switches[depth].len] = offset; tb->switches[depth].count[tb->switches[depth].len] = count; tb->switches[depth].len++; } /* * Write out a leaf node for the new file, and remember its * file offset. */ { struct trie_leaf leaf; leaf.c.type = TRIE_LEAF; leaf.file = *file; /* structure copy */ tb_align(tb); tb->lastoff = tb->offset; tb_write(tb, &leaf, sizeof(leaf)); } /* * Store this pathname for comparison with the next one. */ if (tb->lastsize < pathlen+1) { tb->lastsize = pathlen * 3 / 2 + 64; tb->lastpath = sresize(tb->lastpath, tb->lastsize, char); } strcpy(tb->lastpath, pathname); tb->lastlen = pathlen; } int triebuild_finish(triebuild *tb) { struct trie_header th; setup_magic(&th.magic); th.root = triebuild_unwind(tb, 0, &th.count); th.indexroot = 0; th.maxpathlen = tb->maxpathlen; th.pathsep = (unsigned char)pathsep; tb_seek(tb, 0); tb_write(tb, &th, sizeof(th)); return th.count; } void triebuild_free(triebuild *tb) { sfree(tb->switches); sfree(tb->lastpath); sfree(tb); } /* ---------------------------------------------------------------------- * Memory-mapped trie modification. */ #define MNODE(t, off, type) \ ((struct type *)((char *)(t) + (off))) static unsigned long long fake_atime_recurse(void *t, struct trie_common *node, int last_seen_pathsep) { while (node->type == TRIE_STRING) { struct trie_string *st = (struct trie_string *)node; last_seen_pathsep = (st->string[st->stringlen-1] == pathsep); node = MNODE(t, st->subnode, trie_common); } if (node->type == TRIE_LEAF) { struct trie_leaf *leaf = (struct trie_leaf *)node; return leaf->file.atime; } else if (assert(node->type == TRIE_SWITCH), 1) { struct trie_switch *sw = (struct trie_switch *)node; const char *chars = (const char *)&sw->sw[sw->len]; unsigned long long max = 0, subdir, ret; int i; int slashindex = -1, bareindex = -1; /* * First, process all the children of this node whose * switch characters are not \0 or pathsep. We do this in * reverse order so as to maintain best cache locality * (tracking generally backwards through the file), though * it doesn't matter semantically. * * For each of these children, we're just recursing into * it to do any fixups required below it, and amalgamating * the max atimes we get back. */ for (i = sw->len; i-- > 0 ;) { if (chars[i] == '\0') { bareindex = i; } else if (chars[i] == pathsep) { slashindex = i; } else { ret = fake_atime_recurse(t, MNODE(t, sw->sw[i].subnode, trie_common), 0); if (max < ret) max = ret; } } /* * Now we have at most two child nodes left to deal with: * one with a slash (or general pathsep) and one with \0. * * If there's a slash node and a bare node, then the slash * node contains precisely everything inside the directory * described by the bare node; so we must retrieve the max * atime for the slash node and use it to fix up the bare * node. * * If there's only a bare node but the pathname leading up * to this point ends in a slash, then _all_ of the child * nodes of this node contain stuff inside the directory * described by the bare node; so we use the whole of the * maximum value we've computed so far to update the bare * node. */ if (slashindex >= 0) { ret = fake_atime_recurse(t, MNODE(t, sw->sw[slashindex].subnode, trie_common), 1); if (max < ret) max = ret; subdir = ret; } else if (last_seen_pathsep) { subdir = max; } else { /* Don't update the bare subnode at all. */ subdir = 0; } if (bareindex >= 0) { struct trie_leaf *leaf; leaf = MNODE(t, sw->sw[bareindex].subnode, trie_leaf); if (leaf && leaf->c.type == TRIE_LEAF) { if (leaf->file.atime < subdir) leaf->file.atime = subdir; ret = leaf->file.atime; } else { /* Shouldn't really happen, but be cautious anyway */ ret = fake_atime_recurse(t, &leaf->c, 0); } if (max < ret) max = ret; } return max; } return 0; /* placate lint */ } void trie_fake_dir_atimes(void *t) { struct trie_header *hdr = MNODE(t, 0, trie_header); struct trie_common *node = MNODE(t, hdr->root, trie_common); fake_atime_recurse(t, node, 1); } /* ---------------------------------------------------------------------- * Querying functions. */ #define NODE(t, off, type) \ ((const struct type *)((const char *)(t) + (off))) int trie_check_magic(const void *t) { const struct trie_header *hdr = NODE(t, 0, trie_header); struct trie_magic magic; setup_magic(&magic); return !memcmp(&magic, &hdr->magic, sizeof(magic)); } size_t trie_maxpathlen(const void *t) { const struct trie_header *hdr = NODE(t, 0, trie_header); return hdr->maxpathlen; } unsigned long trie_before(const void *t, const char *pathname) { const struct trie_header *hdr = NODE(t, 0, trie_header); int ret = 0, lastcount = hdr->count; int len = 1 + strlen(pathname), depth = 0; off_t off = hdr->root; while (1) { const struct trie_common *node = NODE(t, off, trie_common); if (node->type == TRIE_LEAF) { if (depth < len) ret += lastcount; /* _shouldn't_ happen, but in principle */ return ret; } else if (node->type == TRIE_STRING) { const struct trie_string *st = NODE(t, off, trie_string); int offset; int cmp = triencmp(st->string, st->stringlen, pathname + depth, len-depth, &offset, pathsep); if (offset < st->stringlen) { if (cmp < 0) ret += lastcount; return ret; } depth += st->stringlen; off = st->subnode; } else if (node->type == TRIE_SWITCH) { const struct trie_switch *sw = NODE(t, off, trie_switch); const char *chars = (const char *)&sw->sw[sw->len]; int i; for (i = 0; i < sw->len; i++) { int c = chars[i]; int cmp = trieccmp(pathname[depth], c, pathsep); if (cmp > 0) ret += sw->sw[i].subcount; else if (cmp < 0) return ret; else { off = sw->sw[i].subnode; lastcount = sw->sw[i].subcount; depth++; break; } } if (i == sw->len) return ret; } } } void trie_getpath(const void *t, unsigned long n, char *buf) { const struct trie_header *hdr = NODE(t, 0, trie_header); int depth = 0; off_t off = hdr->root; while (1) { const struct trie_common *node = NODE(t, off, trie_common); if (node->type == TRIE_LEAF) { assert(depth > 0 && buf[depth-1] == '\0'); return; } else if (node->type == TRIE_STRING) { const struct trie_string *st = NODE(t, off, trie_string); memcpy(buf + depth, st->string, st->stringlen); depth += st->stringlen; off = st->subnode; } else if (node->type == TRIE_SWITCH) { const struct trie_switch *sw = NODE(t, off, trie_switch); const char *chars = (const char *)&sw->sw[sw->len]; int i; for (i = 0; i < sw->len; i++) { if (n < sw->sw[i].subcount) { buf[depth++] = chars[i]; off = sw->sw[i].subnode; break; } else n -= sw->sw[i].subcount; } assert(i < sw->len); } } } const struct trie_file *trie_getfile(const void *t, unsigned long n) { const struct trie_header *hdr = NODE(t, 0, trie_header); off_t off = hdr->root; while (1) { const struct trie_common *node = NODE(t, off, trie_common); if (node->type == TRIE_LEAF) { const struct trie_leaf *leaf = NODE(t, off, trie_leaf); return &leaf->file; } else if (node->type == TRIE_STRING) { const struct trie_string *st = NODE(t, off, trie_string); off = st->subnode; } else if (node->type == TRIE_SWITCH) { const struct trie_switch *sw = NODE(t, off, trie_switch); int i; for (i = 0; i < sw->len; i++) { if (n < sw->sw[i].subcount) { off = sw->sw[i].subnode; break; } else n -= sw->sw[i].subcount; } assert(i < sw->len); } } } unsigned long trie_count(const void *t) { const struct trie_header *hdr = NODE(t, 0, trie_header); return hdr->count; } char trie_pathsep(const void *t) { const struct trie_header *hdr = NODE(t, 0, trie_header); return (char)hdr->pathsep; } struct triewalk_switch { const struct trie_switch *sw; int pos, depth, count; }; struct triewalk { const void *t; struct triewalk_switch *switches; int nswitches, switchsize; int count; }; triewalk *triewalk_new(const void *vt) { triewalk *tw = snew(triewalk); tw->t = (const char *)vt; tw->switches = NULL; tw->switchsize = 0; tw->nswitches = -1; tw->count = 0; return tw; } void triewalk_rebase(triewalk *tw, const void *t) { ptrdiff_t diff = ((const unsigned char *)t - (const unsigned char *)(tw->t)); int i; tw->t = t; for (i = 0; i < tw->nswitches; i++) tw->switches[i].sw = (const struct trie_switch *) ((const unsigned char *)(tw->switches[i].sw) + diff); } const struct trie_file *triewalk_next(triewalk *tw, char *buf) { off_t off; int depth; if (tw->nswitches < 0) { const struct trie_header *hdr = NODE(tw->t, 0, trie_header); off = hdr->root; depth = 0; tw->nswitches = 0; } else { while (1) { int swpos; const struct trie_switch *sw; const char *chars; if (tw->nswitches == 0) { assert(tw->count == NODE(tw->t, 0, trie_header)->count); return NULL; /* run out of trie */ } swpos = tw->switches[tw->nswitches-1].pos; sw = tw->switches[tw->nswitches-1].sw; chars = (const char *)&sw->sw[sw->len]; if (swpos < sw->len) { depth = tw->switches[tw->nswitches-1].depth; off = sw->sw[swpos].subnode; if (buf) buf[depth++] = chars[swpos]; assert(tw->count == tw->switches[tw->nswitches-1].count); tw->switches[tw->nswitches-1].count += sw->sw[swpos].subcount; tw->switches[tw->nswitches-1].pos++; break; } tw->nswitches--; } } while (1) { const struct trie_common *node = NODE(tw->t, off, trie_common); if (node->type == TRIE_LEAF) { const struct trie_leaf *lf = NODE(tw->t, off, trie_leaf); if (buf) assert(depth > 0 && buf[depth-1] == '\0'); tw->count++; return &lf->file; } else if (node->type == TRIE_STRING) { const struct trie_string *st = NODE(tw->t, off, trie_string); if (buf) memcpy(buf + depth, st->string, st->stringlen); depth += st->stringlen; off = st->subnode; } else if (node->type == TRIE_SWITCH) { const struct trie_switch *sw = NODE(tw->t, off, trie_switch); const char *chars = (const char *)&sw->sw[sw->len]; if (tw->nswitches >= tw->switchsize) { tw->switchsize = tw->nswitches * 3 / 2 + 32; tw->switches = sresize(tw->switches, tw->switchsize, struct triewalk_switch); } tw->switches[tw->nswitches].sw = sw; tw->switches[tw->nswitches].pos = 1; tw->switches[tw->nswitches].depth = depth; tw->switches[tw->nswitches].count = tw->count + sw->sw[0].subcount; off = sw->sw[0].subnode; if (buf) buf[depth++] = chars[0]; tw->nswitches++; } } } void triewalk_free(triewalk *tw) { sfree(tw->switches); sfree(tw); } void trie_set_index_offset(void *t, off_t ptr) { ((struct trie_header *)t)->indexroot = ptr; } off_t trie_get_index_offset(const void *t) { return ((const struct trie_header *)t)->indexroot; } void make_successor(char *pathbuf) { int len = strlen(pathbuf); if (len > 0 && pathbuf[len-1] == pathsep) len--; pathbuf[len] = '\001'; pathbuf[len+1] = '\0'; } ������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/tree.but���������������������������������������������������������������������0000644�0001750�0001750�00000031301�14151034324�014622� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������\cfg{html-chapter-numeric}{true} \cfg{html-chapter-suffix}{. } \cfg{chapter}{Section} \define{dash} \u2013{-} \title A tree structure for log-time quadrant counting This article describes a general form of data structure which permits the storing of two-dimensional data in such a way that a log-time lookup can reliably return the total \e{amount} of data in an arbitrary quadrant (i.e. a region both upward and leftward of a user-specified point). The program \W{https://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, which uses a data structure of this type for its on-disk index files, is used as a case study. \C{problem} The problem The basic problem can be stated as follows: you have a collection of \cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and you want to be able to efficiently answer questions of the general form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}, for arbitrary \cw{(x0,y0)}. A slightly enhanced form of the problem, which is no more difficult to solve, allows each point to have a \e{weight} as well as coordinates; now the question to be answered is \q{What is the \e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The previous version of the question, of course, is a special case of this in which all weights are set to 1. The data structure presented in this article answers any question of this type in time \cw{O(log N)}. \C{onedim} The one-dimensional case To begin with, we address the one-dimensional analogue of the problem. If we have a set of points which each have an \cw{x}-coordinate and a weight, how do we represent them so as to be able to find the total weight of all points with \cw{x\_<\_x0}? An obvious solution is to sort the points into order of their \cw{x}-coordinate, and alongside each point to list the total weight of everything up to and including that point. Then, given any \cw{x0}, a log-time binary search will find the last point in the list with \cw{x\_<\_x0}, and we can immediately read off the cumulative weight figure. Now let's be more ambitious. What if the set of points were constantly changing \dash new points arriving, old points going away \dash and we wanted a data structure which would cope efficiently with dynamic updates while maintaining the ability to answer these count queries? An efficient solution begins by storing the points in a balanced tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash AVL tree, red-black tree, the various forms of B-tree \dash will suffice. We then annotate every node of the tree with an extra field giving the total weight of the points in and below that node. These annotations can be maintained through updates to the tree, without sacrificing the \cw{O(log N)} time bound on insertions or deletions. So we can start with an empty tree and insert a complete set of points, or we can insert and delete points in arbitrary order, and the tree annotations will remain valid at all times. A balanced tree sorted by \cw{x} can easily be searched to find the last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight annotations as described above, we can arrange that this search calculates the total weight of all points with \cw{x\_<\_x0}: every time we examine a tree node and decide which subtree of it to descend to, we look at the top node of each subtree to the left of that one, and add their weight annotations to a running total. When we reach a leaf node and find our chosen subtree is \cw{NULL}, the running total is the required answer. So this data structure allows us to maintain a changing set of points in such a way that we can do an efficient one-dimensional count query at any time. \C{incremental} An incremental approach Now we'll use the above one-dimensional structure to answer a restricted form of the original 2-D problem. Suppose we have some large set of \cw{(x0,y0)} pairs for which we want to answer counted-quadrant queries, and suppose (for the moment) that we have \e{all of the queries presented in advance}. Is there any way we can do that efficiently? There is. We sort our points by their \cw{y}-coordinate, and then go through them in that order adding them one by one to a balanced tree sorted by \cw{x} as described above. So for any query coordinates \cw{(x0,y0)}, there must be some moment during that process at which we have added to our tree precisely those points with \cw{y\_<\_y0}. At that moment, we could search the tree to find the total weight of everything it contained with \cw{x\_<\_x0}, and that would be the answer to our two-dimensional query. Hence, if we also sorted our queries into order by \cw{y0}, we could progress through the list of queries in parallel with the list of points (much like merging two sorted lists), answering each query at the moment when the tree contained just the right set of points to make it easy. \C{cow} Copy-on-write In real life, of course, we typically don't receive all our queries in a big batch up front. We want to construct a data structure capable of answering \e{any} query efficiently, and then be able to deal with queries as they arise. A data structure capable of this, it turns out, is only a small step away from the one described in the previous section. The small step is \e{copy-on-write}. As described in the previous section, we go through our list of points in order of their \cw{y}-coordinate, and we add each one in turn to a balanced tree sorted by \cw{x} with total-weight annotations. The catch is that, this time, we never \e{modify} any node in the tree: whenever the process of inserting an element into a tree wants to modify a node, we instead make a \e{copy} of that node containing the modified data. The parent of that node must now be modified (even if it would previously not have needed modification) so that it points at the copied node instead of the original \dash so we do the same thing there, and copy that one too, and so on up to the root. So after we have done a single insertion by this method, we end up with a new tree root which describes the new tree \dash but the old tree root, describing the tree before the insertion, is still valid, because every node reachable from the old root is unchanged. Therefore, if we start with an empty tree and repeatedly do this, we will end up with a distinct tree root for each point in the process, and \e{all of them will be valid at once}. It's as if we had done the incremental tree construction in the previous section, but could rewind time to any point in the process. So now all we have to do is make a list of those tree roots, with their associated \cw{y}-coordinates. Any way of doing this will do \dash another balanced tree, or a simple sorted list to be binary-searched, or something more exotic, whatever is convenient. Then we can answer an arbitrary quadrant query using only a pair of log-time searches: given a query coordinate pair \cw{(x0,y0)}, we look through our list of tree roots to find the one describing precisely the set of points with \cw{y\_<\_y0}, and then do a one-dimensional count query into that tree for the total weight of points with \cw{x\_<\_x0}. Done! The nice thing about all of this is that \e{nearly all} the nodes in each tree are shared with the next one. Consider: since the operation of adding an element to a balanced tree takes \cw{O(log N)} time, it must modify at most \cw{O(log N)} nodes. Each of these node-copying insertion processes must copy all of those nodes, but need not copy any others \dash so it creates at most \cw{O(log N)} new nodes. Hence, the total storage used by the combined set of trees is \cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if the trees were all separate or even mostly separate. \C{limitations} Limitations The one-dimensional data structure described at the start of this article is dynamically updatable: if the set of points to be searched changes, the structure can be modified efficiently without losing its searching properties. The two-dimensional structure we ended up with is not: if a single point changes its coordinates or weight, or appears, or disappears, then the whole structure must be rebuilt. Since the technique I used to add an extra dimension is critically dependent on the dynamic updatability of the one-dimensional base structure, but the resulting structure is not dynamically updatable in turn, it follows that this technique cannot be applied twice: no analogous transformation will construct a \e{three}-dimensional structure capable of counting the total weight of an octant \cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way to do that. The structure as described above uses \cw{O(N log N)} storage. Many algorithms using \cw{O(N log N)} time are considered efficient (e.g. sorting), but space is generally more expensive than time, and \cw{O(N log N)} space is larger than you think! \C{application} An application: \cw{agedu} The application for which I developed this data structure is \W{https://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a program which scans a file system and indexes pathnames against last-access times (\q{atimes}, in Unix terminology) so as to be able to point out directories which take up a lot of disk space but whose contents have not been accessed in years (making them strong candidates for tidying up or archiving to save space). So the fundamental searching primitive we want for \cw{agedu} is \q{What is the total size of the files contained within some directory path \cw{Ptop} which have atime at most \cw{T0}?} We begin by sorting the files into order by their full pathname. This brings every subdirectory, at every level, together into a contiguous sequence of files. So now our query primitive becomes \q{What is the total size of files whose pathname falls between \cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?} Clearly, we can simplify this to the same type of quadrant query as discussed above, by splitting this into two subqueries: the total size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of precisely the type we have just derived a data structure to answer. So we want to sort our files by two \q{coordinates}: one is atime, and the other is pathname (sorted in ASCII collation order). So which should be \cw{x} and which \cw{y}, in the above notation? Well, either way round would work in principle, but two criteria inform the decision. Firstly, \cw{agedu} typically wants to do many queries on the same pathname for different atimes, so as to build up a detailed profile of size against atime for a given subdirectory. So it makes sense to have the first-level lookup (on \cw{y}, to find a tree root) be done on pathname, and the secondary lookup (on \cw{x}, within that tree) be done on atime; then we can cache the tree root found in the first lookup, and use it many times without wasting time repeating the pathname search. Another important point for \cw{agedu} is that not all tree roots are actually used: the only pathnames ever submitted to a quadrant search are those at the start or the end of a particular subdirectory. This allows us to save a lot of disk space by limiting the copy-on-write behaviour: instead of \e{never} modifying an existing tree node, we now rule that we may modify a tree node \e{if it has already been modified once since the last tree root we saved}. In real-world cases, this cuts down the total space usage by about a factor of five, so it's well worthwhile \dash and we wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate instead of pathname. Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the top-level lookup to find a tree root is most efficiently done using neither a balanced tree nor a sorted list, but a trie: tries allow lookup of a string in time proportional to the length of the string, whereas either of the other approaches would require \cw{O(log N)} compare operations \e{each} of which could take time proportional to the length of the string. Finally, the two-dimensions limitation on the above data structure unfortunately imposes limitations on what \cw{agedu} can do. One would like to run a single \cw{agedu} as a system-wide job on a file server (perhaps nightly or weekly), and present the results to all users in such a way that each user's view of the data was filtered to only what their access permissions permitted them to see. Sadly, to do this would require a third dimension in the data structure (representing ownership or access control, in some fashion), and hence cannot be done efficiently. \cw{agedu} is therefore instead most sensibly used on demand by an individual user, so that it generates a custom data set for that user every time. �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/licence.c��������������������������������������������������������������������0000644�0001750�0001750�00000002472�14151034324�014724� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������#include "agedu.h" const char *const licence[] = { PNAME " is copyright 2008 Simon Tatham. All rights reserved.\n", "\n", "Permission is hereby granted, free of charge, to any person\n", "obtaining a copy of this software and associated documentation files\n", "(the \"Software\"), to deal in the Software without restriction,\n", "including without limitation the rights to use, copy, modify, merge,\n", "publish, distribute, sublicense, and/or sell copies of the Software,\n", "and to permit persons to whom the Software is furnished to do so,\n", "subject to the following conditions:\n", "\n", "The above copyright notice and this permission notice shall be\n", "included in all copies or substantial portions of the Software.\n", "\n", "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND,\n", "EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF\n", "MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND\n", "NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS\n", "BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN\n", "ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN\n", "CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n", "SOFTWARE.\n", 0 }; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/index.h����������������������������������������������������������������������0000644�0001750�0001750�00000004737�14151034324�014444� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * index.h: Manage indexes for agedu. */ #include <sys/types.h> /* * Given the size of an existing data file and the number of * entries required in the index, return the file size required to * store the fixed initial fragment of the index before the * indexbuild functions are invoked. */ off_t index_initial_size(off_t currentsize, int nodecount); /* * Build an index, given the address of a memory-mapped data file * and the starting offset within it. * * trie_file structures passed to tf must of course be within the * bounds of the memory-mapped file. * * The "delta" parameter to indexbuild_new is filled in with the * maximum size by which the index can ever grow in a single * indexbuild_add call. This can be used, together with * indexbuild_realsize, to detect when the index is about to get * too big for its file and the file needs resizing. * * indexbuild_add adds a trie_file structure to the index. * indexbuild_tag, called after that, causes the current state of * the index to be preserved so that it can be queried later. It's * idempotent, and will safely do nothing if called before * indexbuild_add. * * indexbuild_realsize returns the total amount of data _actually_ * written into the file, to allow it to be truncated afterwards, * and to allow the caller of the indexbuild functions to know * when the file is about to need to grow bigger during index * building. * * indexbuild_rebase is used when the file has been * re-memory-mapped at a different address (because it needs to * grow). */ typedef struct indexbuild indexbuild; indexbuild *indexbuild_new(void *t, off_t startoff, int nodecount, size_t *delta); void indexbuild_add(indexbuild *ib, const struct trie_file *tf); void indexbuild_tag(indexbuild *ib); void indexbuild_rebase(indexbuild *ib, void *t); off_t indexbuild_realsize(indexbuild *ib); void indexbuild_free(indexbuild *ib); /* * Check to see if a name index has an index tree root available * (i.e. represents a directory rather than a file). */ int index_has_root(const void *t, int n); /* * Query an index to find the total size of records with name * index strictly less than n, with atime less than at. */ unsigned long long index_query(const void *t, int n, unsigned long long at); /* * Retrieve an order statistic from the index: given a fraction f, * return an atime such that (at most) the requested fraction of * the total data is older than it. */ unsigned long long index_order_stat(const void *t, double f); ���������������������������������agedu-20211129.8cd63c5/index.c����������������������������������������������������������������������0000644�0001750�0001750�00000023756�14151034324�014441� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * index.c: Implementation of index.h. */ #include "agedu.h" #include "trie.h" #include "index.h" #include "alloc.h" #define alignof(typ) ( offsetof(struct { char c; typ t; }, t) ) #define PADDING(x, mod) ( ((mod) - ((x) % (mod))) % (mod) ) struct avlnode { off_t children[2], element; int maxdepth; /* maximum depth of this subtree */ unsigned long long totalsize; }; /* * Determine the maximum depth of an AVL tree containing a certain * number of nodes. */ static int index_maxdepth(int nodecount) { int b, c, maxdepth; /* * Model the tree growing at maximum imbalance. We do this by * determining the number of nodes in the most unbalanced * (i.e. smallest) tree of any given depth, and stopping when * that's larger than nodecount. */ maxdepth = 1; b = 0; c = 1; while (b <= nodecount) { int tmp; tmp = 1 + b + c; b = c; c = tmp; maxdepth++; } return maxdepth; } off_t index_initial_size(off_t currentsize, int nodecount) { currentsize += PADDING(currentsize, alignof(off_t)); currentsize += nodecount * sizeof(off_t); currentsize += PADDING(currentsize, alignof(struct avlnode)); return currentsize; } /* ---------------------------------------------------------------------- * Functions to build the index. */ struct indexbuild { void *t; int n, nnodes; struct avlnode *nodes; off_t *roots; struct avlnode *currroot; struct avlnode *firstmutable; }; #define ELEMENT(t,offset) \ ((struct trie_file *) ((offset) ? ((char *)(t) + (offset)) : NULL)) #define NODE(t,offset) \ ((struct avlnode *) ((offset) ? ((char *)(t) + (offset)) : NULL)) #define OFFSET(t,node) \ ((node) ? (off_t)((const char *)node - (const char *)t) : (off_t)0) #define MAXDEPTH(node) ((node) ? (node)->maxdepth : 0) indexbuild *indexbuild_new(void *t, off_t startoff, int nodecount, size_t *delta) { indexbuild *ib = snew(indexbuild); ib->t = t; startoff += PADDING(startoff, alignof(off_t)); ib->roots = (off_t *)((char *)t + startoff); trie_set_index_offset(t, startoff); startoff += nodecount * sizeof(off_t); startoff += PADDING(startoff, alignof(struct avlnode)); ib->nodes = (struct avlnode *)((char *)t + startoff); ib->nnodes = ib->n = 0; ib->currroot = NULL; ib->firstmutable = ib->nodes + ib->nnodes; if (delta) *delta = sizeof(struct avlnode) * (1 + index_maxdepth(nodecount)); return ib; } /* * Return a mutable node, which is n or a copy of n if n is * non-NULL. */ static struct avlnode *avl_makemutable(indexbuild *ib, struct avlnode *n) { struct avlnode *newnode; if (n && n >= ib->firstmutable) return n; /* already mutable */ newnode = ib->nodes + ib->nnodes++; if (n) *newnode = *n; /* structure copy */ return newnode; } /* * Fix the annotations in a tree node. */ static void avl_fix(indexbuild *ib, struct avlnode *n) { /* * Make sure the max depth field is right. */ n->maxdepth = 1 + max(MAXDEPTH(NODE(ib->t, n->children[0])), MAXDEPTH(NODE(ib->t, n->children[1]))); n->totalsize = (ELEMENT(ib->t, n->element)->size + (n->children[0] ? NODE(ib->t, n->children[0])->totalsize : 0) + (n->children[1] ? NODE(ib->t, n->children[1])->totalsize : 0)); } static struct avlnode *avl_insert(indexbuild *ib, struct avlnode *n, off_t node) { struct trie_file *newfile; struct trie_file *oldfile; int subtree; struct avlnode *nn; /* * Recursion bottoming out: if the subtree we're inserting * into is null, just construct and return a fresh node. */ if (!n) { n = avl_makemutable(ib, NULL); n->children[0] = n->children[1] = 0; n->element = node; avl_fix(ib, n); return n; } /* * Otherwise, we have to insert into an existing tree. */ /* * Determine which subtree to insert this node into. Ties * aren't important, so we just break them any old way. */ newfile = (struct trie_file *)((char *)ib->t + node); oldfile = (struct trie_file *)((char *)ib->t + n->element); if (newfile->atime > oldfile->atime) subtree = 1; else subtree = 0; /* * Construct a copy of the node we're looking at. */ n = avl_makemutable(ib, n); /* * Recursively insert into the next subtree down. */ nn = avl_insert(ib, NODE(ib->t, n->children[subtree]), node); n->children[subtree] = OFFSET(ib->t, nn); /* * Rebalance if necessary, to ensure that our node's children * differ in maximum depth by at most one. Of course, the * subtree we've just modified will be the deeper one if so. */ if (MAXDEPTH(NODE(ib->t, n->children[subtree])) > MAXDEPTH(NODE(ib->t, n->children[1-subtree])) + 1) { struct avlnode *p, *q; /* * There are two possible cases, one of which requires a * single tree rotation and the other requires two. It all * depends on which subtree of the next node down (here p) * is the taller. (It turns out that they can't both be * the same height: any tree which has just increased in * depth must have one subtree strictly taller than the * other.) */ p = NODE(ib->t, n->children[subtree]); assert(p >= ib->firstmutable); if (MAXDEPTH(NODE(ib->t, p->children[subtree])) >= MAXDEPTH(NODE(ib->t, p->children[1-subtree]))) { /* * n p * / \ / \ * [k] p -> n [k+1] * / \ / \ * [k] [k+1] [k] [k] */ n->children[subtree] = p->children[1-subtree]; p->children[1-subtree] = OFFSET(ib->t, n); avl_fix(ib, n); n = p; } else { q = NODE(ib->t, p->children[1-subtree]); assert(q >= ib->firstmutable); p->children[1-subtree] = OFFSET(ib->t, q); /* * n n q * / \ / \ / \ * [k] p == [k] p -> n p * / \ / \ / \ / \ * [k+1] [k] q [k] [k] \ / [k] * / \ [k-1,k] [k-1,k] * [k-1,k] [k-1,k] */ n->children[subtree] = q->children[1-subtree]; p->children[1-subtree] = q->children[subtree]; q->children[1-subtree] = OFFSET(ib->t, n); q->children[subtree] = OFFSET(ib->t, p); avl_fix(ib, n); avl_fix(ib, p); n = q; } } /* * Fix up our maximum depth field. */ avl_fix(ib, n); /* * Done. */ return n; } void indexbuild_add(indexbuild *ib, const struct trie_file *tf) { off_t node = OFFSET(ib->t, tf); ib->currroot = avl_insert(ib, ib->currroot, node); ib->roots[ib->n++] = 0; } void indexbuild_tag(indexbuild *ib) { if (ib->n > 0) ib->roots[ib->n - 1] = OFFSET(ib->t, ib->currroot); ib->firstmutable = ib->nodes + ib->nnodes; } void indexbuild_rebase(indexbuild *ib, void *t) { ptrdiff_t diff = (unsigned char *)t - (unsigned char *)(ib->t); ib->t = t; ib->nodes = (struct avlnode *)((unsigned char *)ib->nodes + diff); ib->roots = (off_t *)((unsigned char *)ib->roots + diff); if (ib->currroot) ib->currroot = (struct avlnode *) ((unsigned char *)ib->currroot + diff); ib->firstmutable = (struct avlnode *)((unsigned char *)ib->firstmutable + diff); } off_t indexbuild_realsize(indexbuild *ib) { return OFFSET(ib->t, (ib->nodes + ib->nnodes)); } void indexbuild_free(indexbuild *ib) { assert(ib->n == trie_count(ib->t)); sfree(ib); } int index_has_root(const void *t, int n) { const off_t *roots; roots = (const off_t *)((const char *)t + trie_get_index_offset(t)); if (n == 0) return 1; if (n < 0 || n >= trie_count(t) || !roots[n-1]) return 0; return 1; } unsigned long long index_query(const void *t, int n, unsigned long long at) { const off_t *roots; const struct avlnode *node; unsigned long count; unsigned long long ret; roots = (const off_t *)((const char *)t + trie_get_index_offset(t)); if (n < 1) return 0; count = trie_count(t); if (n > count) n = count; assert(roots[n-1]); node = NODE(t, roots[n-1]); ret = 0; while (node) { const struct trie_file *tf = ELEMENT(t, node->element); const struct avlnode *left = NODE(t, node->children[0]); const struct avlnode *right = NODE(t, node->children[1]); if (at <= tf->atime) { node = left; } else { if (left) ret += left->totalsize; ret += tf->size; node = right; } } return ret; } unsigned long long index_order_stat(const void *t, double f) { const off_t *roots; const struct avlnode *node; unsigned long count; unsigned long long size; roots = (const off_t *)((const char *)t + trie_get_index_offset(t)); count = trie_count(t); assert(roots[count-1]); node = NODE(t, roots[count-1]); size = node->totalsize * f; if (size > node->totalsize) { /* * This can happen in principle if node->totalsize is so * enormous (bigger than 2^53 bytes) that it can't be * represented faithfully in a 'double'. Then it might be * rounded _up_ by the conversion to double, and if * multiplication by f doesn't make it smaller again, end up * just bigger than node->totalsize by the time we convert * back to an integer. * * It takes a huge filesystem to make that happen in reality - * but a corrupt index file could also trip the same case, so * we should at least handle it gracefully. */ size = node->totalsize; } while (1) { const struct trie_file *tf = ELEMENT(t, node->element); const struct avlnode *left = NODE(t, node->children[0]); const struct avlnode *right = NODE(t, node->children[1]); if (left && size < left->totalsize) { node = left; } else if (!right || size < (left ? left->totalsize : 0) + tf->size) { return tf->atime; } else { if (left) size -= left->totalsize; size -= tf->size; node = right; } } } ������������������agedu-20211129.8cd63c5/httpd.h����������������������������������������������������������������������0000644�0001750�0001750�00000000677�14151034324�014457� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * httpd.h: a minimal custom HTTP server designed to serve the * pages generated by html.h. */ #define HTTPD_AUTH_MAGIC 1 #define HTTPD_AUTH_BASIC 2 #define HTTPD_AUTH_NONE 4 struct httpd_config { const char *address, *port; bool closeoneof; const char *basicauthdata; const char *url_launch_command; }; void run_httpd(const void *t, int authmask, const struct httpd_config *dcfg, const struct html_config *pcfg); �����������������������������������������������������������������agedu-20211129.8cd63c5/httpd.c����������������������������������������������������������������������0000644�0001750�0001750�00000074577�14151034324�014464� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * httpd.c: implementation of httpd.h. */ #include "agedu.h" #include "alloc.h" #include "html.h" #include "httpd.h" /* --- Logic driving what the web server's responses are. --- */ enum { /* connctx states */ READING_REQ_LINE, READING_HEADERS, DONE }; struct connctx { const void *t; char *data; int datalen, datasize; char *method, *url, *headers, *auth; int state; }; /* * Called when a new connection arrives on a listening socket. * Returns a connctx for the new connection. */ struct connctx *new_connection(const void *t) { struct connctx *cctx = snew(struct connctx); cctx->t = t; cctx->data = NULL; cctx->datalen = cctx->datasize = 0; cctx->state = READING_REQ_LINE; cctx->method = cctx->url = cctx->headers = cctx->auth = NULL; return cctx; } void free_connection(struct connctx *cctx) { sfree(cctx->data); sfree(cctx); } static char *http_error(char *code, char *errmsg, char *extraheader, char *errtext, ...) { return dupfmt("HTTP/1.1 %s %s\r\n" "Date: %D\r\n" "Server: " PNAME "\r\n" "Connection: close\r\n" "%s" "Content-Type: text/html; charset=US-ASCII\r\n" "\r\n" "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\r\n" "<HTML><HEAD>\r\n" "<TITLE>%s %s\r\n" "\r\n" "

%s %s

\r\n" "

%s

\r\n" "\r\n", code, errmsg, extraheader ? extraheader : "", code, errmsg, code, errmsg, errtext); } static char *http_success(char *mimetype, bool stuff_cr, char *document) { return dupfmt("HTTP/1.1 200 OK\r\n" "Date: %D\r\n" "Expires: %D\r\n" "Server: " PNAME "\r\n" "Connection: close\r\n" "Content-Type: %s\r\n" "\r\n" "%S", mimetype, stuff_cr, document); } /* * Called when data comes in on a connection. * * If this function returns NULL, the platform code continues * reading from the socket. Otherwise, it returns some dynamically * allocated data which the platform code will then write to the * socket before closing it. */ char *got_data(struct connctx *ctx, char *data, int length, bool magic_access, const char *auth_string, const struct html_config *cfg) { char *line, *p, *q, *r, *z1, *z2, c1, c2; bool auth_correct = false; unsigned long index; char *document, *ret; /* * Add the data we've just received to our buffer. */ if (ctx->datasize < ctx->datalen + length) { ctx->datasize = (ctx->datalen + length) * 3 / 2 + 4096; ctx->data = sresize(ctx->data, ctx->datasize, char); } memcpy(ctx->data + ctx->datalen, data, length); ctx->datalen += length; /* * Gradually process the HTTP request as we receive it. */ if (ctx->state == READING_REQ_LINE) { /* * We're waiting for the first line of the input, which * contains the main HTTP request. See if we've got it * yet. */ line = ctx->data; /* * RFC 2616 section 4.1: `In the interest of robustness, * [...] if the server is reading the protocol stream at * the beginning of a message and receives a CRLF first, * it should ignore the CRLF.' */ while (line - ctx->data < ctx->datalen && (*line == '\r' || *line == '\n')) line++; q = line; while (q - ctx->data < ctx->datalen && *q != '\n') q++; if (q - ctx->data >= ctx->datalen) return NULL; /* not got request line yet */ /* * We've got the first line of the request. Zero-terminate * and parse it into method, URL and optional HTTP * version. */ *q = '\0'; ctx->headers = q+1; if (q > line && q[-1] == '\r') *--q = '\0'; z1 = z2 = q; c1 = c2 = *q; p = line; while (*p && !isspace((unsigned char)*p)) p++; if (*p) { z1 = p++; c1 = *z1; *z1 = '\0'; } while (*p && isspace((unsigned char)*p)) p++; q = p; while (*q && !isspace((unsigned char)*q)) q++; z2 = q++; c2 = *z2; *z2 = '\0'; while (*q && isspace((unsigned char)*q)) q++; /* * Now `line' points at the method name; p points at the * URL, if any; q points at the HTTP version, if any. */ /* * There should _be_ a URL, on any request type at all. */ if (!*p) { char *ret, *text; /* Restore the request to the way we received it. */ *z2 = c2; *z1 = c1; text = dupfmt("" PNAME " received the HTTP request" " \"%h\", which contains no URL.", line); ret = http_error("400", "Bad request", NULL, text); sfree(text); return ret; } ctx->method = line; ctx->url = p; /* * If there was an HTTP version, we might need to see * headers. Otherwise, the request is done. */ if (*q) { ctx->state = READING_HEADERS; } else { ctx->state = DONE; } } if (ctx->state == READING_HEADERS) { /* * While we're receiving the HTTP request headers, all we * do is to keep scanning to see if we find two newlines * next to each other. */ q = ctx->data + ctx->datalen; for (p = ctx->headers; p < q; p++) { if (*p == '\n' && ((p+1 < q && p[1] == '\n') || (p+2 < q && p[1] == '\r' && p[2] == '\n'))) { p[1] = '\0'; ctx->state = DONE; break; } } } if (ctx->state == DONE) { /* * Now we have the entire HTTP request. Decide what to do * with it. */ if (auth_string) { /* * Search the request headers for Authorization. */ q = ctx->data + ctx->datalen; for (p = ctx->headers; p < q; p++) { const char *hdr = "Authorization:"; int i; for (i = 0; hdr[i]; i++) { if (p >= q || tolower((unsigned char)*p) != tolower((unsigned char)hdr[i])) break; p++; } if (!hdr[i]) break; /* found our header */ p = memchr(p, '\n', q - p); if (!p) p = q; } if (p < q) { while (p < q && isspace((unsigned char)*p)) p++; r = p; while (p < q && !isspace((unsigned char)*p)) p++; if (p < q) { *p++ = '\0'; if (!strcasecmp(r, "Basic")) { while (p < q && isspace((unsigned char)*p)) p++; r = p; while (p < q && !isspace((unsigned char)*p)) p++; if (p < q) { *p++ = '\0'; if (!strcmp(r, auth_string)) auth_correct = true; } } } } } if (!magic_access && !auth_correct) { if (auth_string) { ret = http_error("401", "Unauthorized", "WWW-Authenticate: Basic realm=\""PNAME"\"\r\n", "\nYou must authenticate to view these pages."); } else { ret = http_error("403", "Forbidden", NULL, "This is a restricted-access set of pages."); } } else { p = ctx->url; if (!html_parse_path(ctx->t, p, cfg, &index)) { ret = http_error("404", "Not Found", NULL, "This is not a valid pathname."); } else { char *canonpath = html_format_path(ctx->t, cfg, index); if (!strcmp(canonpath, p)) { /* * This is a canonical path. Return the document. */ document = html_query(ctx->t, index, cfg, true); if (document) { ret = http_success("text/html", true, document); sfree(document); } else { ret = http_error("404", "Not Found", NULL, "This is not a valid pathname."); } } else { /* * This is a non-canonical path. Return a redirect * to the right one. * * To do this, we must search the request headers * for Host:, to see what the client thought it * was calling our server. */ char *host = NULL; q = ctx->data + ctx->datalen; for (p = ctx->headers; p < q; p++) { const char *hdr = "Host:"; int i; for (i = 0; hdr[i]; i++) { if (p >= q || tolower((unsigned char)*p) != tolower((unsigned char)hdr[i])) break; p++; } if (!hdr[i]) break; /* found our header */ p = memchr(p, '\n', q - p); if (!p) p = q; } if (p < q) { while (p < q && isspace((unsigned char)*p)) p++; r = p; while (p < q) { if (*p == '\r' && (p+1 >= q || p[1] == '\n')) break; p++; } host = snewn(p-r+1, char); memcpy(host, r, p-r); host[p-r] = '\0'; } if (host) { char *header = dupfmt("Location: http://%s%s\r\n", host, canonpath); ret = http_error("301", "Moved", header, "This is not the canonical form of" " this pathname."); sfree(header); } else { ret = http_error("400", "Bad Request", NULL, "Needed a Host: header to return" " the intended redirection."); } } sfree(canonpath); } } return ret; } else return NULL; } /* --- Platform support for running a web server. --- */ enum { FD_CLIENT, FD_LISTENER, FD_CONNECTION }; struct fd { int fd; int type; bool deleted; char *wdata; int wdatalen, wdatapos; bool magic_access; struct connctx *cctx; }; struct fd *fds = NULL; int nfds = 0, fdsize = 0; struct fd *new_fdstruct(int fd, int type) { struct fd *ret; if (nfds >= fdsize) { fdsize = nfds * 3 / 2 + 32; fds = sresize(fds, fdsize, struct fd); } ret = &fds[nfds++]; ret->fd = fd; ret->type = type; ret->wdata = NULL; ret->wdatalen = ret->wdatapos = 0; ret->cctx = NULL; ret->deleted = false; ret->magic_access = false; return ret; } int check_owning_uid(int fd, int flip) { struct sockaddr_storage sock, peer; socklen_t addrlen; char linebuf[4096], matchbuf[128]; char *filename; int matchlen; FILE *fp; addrlen = sizeof(sock); if (getsockname(fd, (struct sockaddr *)&sock, &addrlen)) { fprintf(stderr, "getsockname: %s\n", strerror(errno)); exit(1); } addrlen = sizeof(peer); if (getpeername(fd, (struct sockaddr *)&peer, &addrlen)) { if (errno == ENOTCONN) { memset(&peer, 0, sizeof(peer)); peer.ss_family = sock.ss_family; } else { fprintf(stderr, "getpeername: %s\n", strerror(errno)); exit(1); } } if (flip) { struct sockaddr_storage tmp = sock; sock = peer; peer = tmp; } #ifdef AGEDU_IPV4 if (peer.ss_family == AF_INET) { struct sockaddr_in *sock4 = (struct sockaddr_in *)&sock; struct sockaddr_in *peer4 = (struct sockaddr_in *)&peer; assert(peer4->sin_family == AF_INET); sprintf(matchbuf, "%08X:%04X %08X:%04X", peer4->sin_addr.s_addr, ntohs(peer4->sin_port), sock4->sin_addr.s_addr, ntohs(sock4->sin_port)); filename = "/proc/net/tcp"; } else #endif #ifdef AGEDU_IPV6 if (peer.ss_family == AF_INET6) { struct sockaddr_in6 *sock6 = (struct sockaddr_in6 *)&sock; struct sockaddr_in6 *peer6 = (struct sockaddr_in6 *)&peer; char *p; assert(peer6->sin6_family == AF_INET6); p = matchbuf; for (int i = 0; i < 4; i++) p += sprintf(p, "%08X", ((uint32_t *)peer6->sin6_addr.s6_addr)[i]); p += sprintf(p, ":%04X ", ntohs(peer6->sin6_port)); for (int i = 0; i < 4; i++) p += sprintf(p, "%08X", ((uint32_t *)sock6->sin6_addr.s6_addr)[i]); p += sprintf(p, ":%04X", ntohs(sock6->sin6_port)); filename = "/proc/net/tcp6"; } else #endif { return -1; /* unidentified family */ } matchlen = strlen(matchbuf); fp = fopen(filename, "r"); if (fp) { while (fgets(linebuf, sizeof(linebuf), fp)) { /* * Check for, and skip over, the initial sequence number * that appears before the sockaddr/peeraddr pair. This is * printf'ed as "%4d: ", so it could be prefixed by * spaces, but could also be longer than 4 digits. */ const char *p = linebuf; p += strspn(p, " "); p += strspn(p, "0123456789"); if (*p != ':') goto not_this_line; p++; p += strspn(p, " "); if (!strncmp(p, matchbuf, matchlen)) { /* * This line matches the address string. Skip 4 words * after that (TCP state-machine state, tx/rx queue, * timer details, number of retransmissions) and then * we expect to find the uid. */ int word; p += matchlen; p += strspn(p, " "); for (word = 0; word < 4; word++) { p += strcspn(p, " "); if (*p != ' ') goto not_this_line; p += strspn(p, " "); } fclose(fp); return atoi(p); } not_this_line:; } fclose(fp); } return -1; } void check_magic_access(struct fd *fd) { if (check_owning_uid(fd->fd, 0) == getuid()) fd->magic_access = true; } static void base64_encode_atom(unsigned char *data, int n, char *out) { static const char base64_chars[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; unsigned word; word = data[0] << 16; if (n > 1) word |= data[1] << 8; if (n > 2) word |= data[2]; out[0] = base64_chars[(word >> 18) & 0x3F]; out[1] = base64_chars[(word >> 12) & 0x3F]; if (n > 1) out[2] = base64_chars[(word >> 6) & 0x3F]; else out[2] = '='; if (n > 2) out[3] = base64_chars[word & 0x3F]; else out[3] = '='; } struct listenfds { int v4, v6; }; static int make_listening_sockets(struct listenfds *fds, const char *address, const char *portstr, char **outhostname) { /* * Establish up to 2 listening sockets, for IPv4 and IPv6, on the * same arbitrarily selected port. Return them in fds.v4 and * fds.v6, with each entry being -1 if that socket was not * established at all. Main return value is the port chosen, or <0 * if the whole process failed. */ struct sockaddr_in6 addr6; struct sockaddr_in addr4; bool got_v6, got_v4; socklen_t addrlen; int ret, port = 0; /* * Special case of the address parameter: if it's "0.0.0.0", treat * it like NULL, because that was how you specified listen-on-any- * address in versions before the IPv6 revamp. */ { int u,v,w,x; if (address && 4 == sscanf(address, "%d.%d.%d.%d", &u, &v, &w, &x) && u==0 && v==0 && w==0 && x==0) address = NULL; } if (portstr && !*portstr) portstr = NULL; /* normalise NULL and empty string */ if (!address) { char hostname[HOST_NAME_MAX]; if (gethostname(hostname, sizeof(hostname)) < 0) { perror("hostname"); return -1; } *outhostname = dupstr(hostname); } else { *outhostname = dupstr(address); } fds->v6 = fds->v4 = -1; got_v6 = false; got_v4 = false; #if HAVE_GETADDRINFO /* * Resolve the given address using getaddrinfo, yielding an IPv6 * address or an IPv4 one or both. */ struct addrinfo hints; struct addrinfo *addrs, *ai; hints.ai_family = AF_UNSPEC; hints.ai_socktype = SOCK_STREAM; hints.ai_protocol = 0; hints.ai_flags = AI_PASSIVE; ret = getaddrinfo(address, portstr, &hints, &addrs); if (ret) { fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(ret)); return -1; } for (ai = addrs; ai; ai = ai->ai_next) { #ifdef AGEDU_IPV6 if (!got_v6 && ai->ai_family == AF_INET6) { memcpy(&addr6, ai->ai_addr, ai->ai_addrlen); if (portstr && !port) port = ntohs(addr6.sin6_port); got_v6 = true; } #endif #ifdef AGEDU_IPV4 if (!got_v4 && ai->ai_family == AF_INET) { memcpy(&addr4, ai->ai_addr, ai->ai_addrlen); if (portstr && !port) port = ntohs(addr4.sin_port); got_v4 = true; } #endif } #elif HAVE_GETHOSTBYNAME /* * IPv4-only setup using inet_addr and gethostbyname. */ struct hostent *h; memset(&addr4, 0, sizeof(addr4)); addr4.sin_family = AF_INET; if (!address) { addr4.sin_addr.s_addr = htons(INADDR_ANY); got_v4 = true; } else if (inet_aton(address, &addr4.sin_addr)) { got_v4 = true; /* numeric address */ } else if ((h = gethostbyname(address)) != NULL) { memcpy(&addr4.sin_addr, h->h_addr, sizeof(addr4.sin_addr)); got_v4 = true; } else { fprintf(stderr, "gethostbyname: %s\n", hstrerror(h_errno)); return -1; } if (portstr) { struct servent *s; if (!portstr[strspn(portstr, "0123456789")]) { port = atoi(portstr); } else if ((s = getservbyname(portstr, NULL)) != NULL) { port = ntohs(s->s_port); } else { fprintf(stderr, "getservbyname: port '%s' not understood\n", portstr); return -1; } } #endif #ifdef AGEDU_IPV6 #ifdef AGEDU_IPV4 retry: #endif if (got_v6) { fds->v6 = socket(PF_INET6, SOCK_STREAM, 0); if (fds->v6 < 0) { fprintf(stderr, "socket(PF_INET6): %s\n", strerror(errno)); goto done_v6; } #ifdef IPV6_V6ONLY { int i = 1; if (setsockopt(fds->v6, IPPROTO_IPV6, IPV6_V6ONLY, (char *)&i, sizeof(i)) < 0) { fprintf(stderr, "setsockopt(IPV6_V6ONLY): %s\n", strerror(errno)); close(fds->v6); fds->v6 = -1; goto done_v6; } } #endif /* IPV6_V6ONLY */ addr6.sin6_port = htons(port); addrlen = sizeof(addr6); if (bind(fds->v6, (const struct sockaddr *)&addr6, addrlen) < 0) { fprintf(stderr, "bind: %s\n", strerror(errno)); close(fds->v6); fds->v6 = -1; goto done_v6; } if (listen(fds->v6, 5) < 0) { fprintf(stderr, "listen: %s\n", strerror(errno)); close(fds->v6); fds->v6 = -1; goto done_v6; } if (port == 0) { addrlen = sizeof(addr6); if (getsockname(fds->v6, (struct sockaddr *)&addr6, &addrlen) < 0) { fprintf(stderr, "getsockname: %s\n", strerror(errno)); close(fds->v6); fds->v6 = -1; goto done_v6; } port = ntohs(addr6.sin6_port); } } done_v6: #endif #ifdef AGEDU_IPV4 if (got_v4) { fds->v4 = socket(PF_INET, SOCK_STREAM, 0); if (fds->v4 < 0) { fprintf(stderr, "socket(PF_INET): %s\n", strerror(errno)); goto done_v4; } addr4.sin_port = htons(port); addrlen = sizeof(addr4); if (bind(fds->v4, (const struct sockaddr *)&addr4, addrlen) < 0) { #ifdef AGEDU_IPV6 if (fds->v6 >= 0) { /* * If we support both v6 and v4, it's a failure * condition if we didn't manage to bind to both. If * the port number was arbitrary, we go round and try * again. Otherwise, give up. */ close(fds->v6); close(fds->v4); fds->v6 = fds->v4 = -1; port = 0; if (!portstr) goto retry; } #endif fprintf(stderr, "bind: %s\n", strerror(errno)); close(fds->v4); fds->v4 = -1; goto done_v4; } if (listen(fds->v4, 5) < 0) { fprintf(stderr, "listen: %s\n", strerror(errno)); close(fds->v4); fds->v4 = -1; goto done_v4; } if (port == 0) { addrlen = sizeof(addr4); if (getsockname(fds->v4, (struct sockaddr *)&addr4, &addrlen) < 0) { fprintf(stderr, "getsockname: %s\n", strerror(errno)); close(fds->v4); fds->v4 = -1; goto done_v4; } port = ntohs(addr4.sin_port); } } done_v4: #endif if (fds->v6 >= 0 || fds->v4 >= 0) return port; else return -1; } void run_httpd(const void *t, int authmask, const struct httpd_config *dcfg, const struct html_config *incfg) { struct listenfds lfds; int port; int authtype; char *authstring = NULL; char *hostname; const char *openbracket, *closebracket; struct html_config cfg = *incfg; /* * Establish the listening socket(s) and retrieve its port * number. */ port = make_listening_sockets(&lfds, dcfg->address, dcfg->port, &hostname); if (port < 0) exit(1); /* already reported an error */ if ((authmask & HTTPD_AUTH_MAGIC) && (lfds.v4 < 0 || check_owning_uid(lfds.v4, 1) == getuid()) && (lfds.v6 < 0 || check_owning_uid(lfds.v6, 1) == getuid())) { authtype = HTTPD_AUTH_MAGIC; if (authmask != HTTPD_AUTH_MAGIC) printf("Using Linux /proc/net magic authentication\n"); } else if ((authmask & HTTPD_AUTH_BASIC)) { char username[128], password[128], userpassbuf[259]; const char *userpass; const char *rname; unsigned char passbuf[10]; int i, j, k, fd; authtype = HTTPD_AUTH_BASIC; if (authmask != HTTPD_AUTH_BASIC) printf("Using HTTP Basic authentication\n"); if (dcfg->basicauthdata) { userpass = dcfg->basicauthdata; } else { strcpy(username, PNAME); rname = "/dev/urandom"; fd = open(rname, O_RDONLY); if (fd < 0) { int err = errno; rname = "/dev/random"; fd = open(rname, O_RDONLY); if (fd < 0) { int err2 = errno; fprintf(stderr, "/dev/urandom: open: %s\n", strerror(err)); fprintf(stderr, "/dev/random: open: %s\n", strerror(err2)); exit(1); } } for (i = 0; i < 10 ;) { j = read(fd, passbuf + i, 10 - i); if (j <= 0) { fprintf(stderr, "%s: read: %s\n", rname, j < 0 ? strerror(errno) : "unexpected EOF"); exit(1); } i += j; } close(fd); for (i = 0; i < 16; i++) { /* * 32 characters out of the 36 alphanumerics gives * me the latitude to discard i,l,o for being too * numeric-looking, and w because it has two too * many syllables and one too many presidential * associations. */ static const char chars[32] = "0123456789abcdefghjkmnpqrstuvxyz"; int v = 0; k = i / 8 * 5; for (j = 0; j < 5; j++) v |= ((passbuf[k+j] >> (i%8)) & 1) << j; password[i] = chars[v]; } password[i] = '\0'; sprintf(userpassbuf, "%s:%s", username, password); userpass = userpassbuf; printf("Username: %s\nPassword: %s\n", username, password); } k = strlen(userpass); authstring = snewn(k * 4 / 3 + 16, char); for (i = j = 0; i < k ;) { int s = k-i < 3 ? k-i : 3; base64_encode_atom((unsigned char *)(userpass+i), s, authstring+j); i += s; j += 4; } authstring[j] = '\0'; } else if ((authmask & HTTPD_AUTH_NONE)) { authtype = HTTPD_AUTH_NONE; if (authmask != HTTPD_AUTH_NONE) printf("Web server is unauthenticated\n"); } else { fprintf(stderr, PNAME ": authentication method not supported\n"); exit(1); } if (strchr(hostname, ':')) { /* If the hostname is an IPv6 address literal, enclose it in * square brackets to prevent misinterpretation of the * colons. */ openbracket = "["; closebracket = "]"; } else { openbracket = closebracket = ""; } char *url; if (port == 80) { url = dupfmt("http://%s%s%s/", openbracket, hostname, closebracket); } else { url = dupfmt("http://%s%s%s:%d/", openbracket, hostname, closebracket, port); } printf("URL: %s\n", url); fflush(stdout); if (dcfg->url_launch_command) { pid_t pid = fork(); if (pid < 0) { fprintf(stderr, "Unable to fork for launch command: %s\n", strerror(errno)); } else if (pid == 0) { char *args[5]; args[0] = dupstr("sh"); args[1] = dupstr("-c"); args[2] = dupfmt("%s \"$0\"", dcfg->url_launch_command); args[3] = dupstr(url); args[4] = NULL; execvp("/bin/sh", args); _exit(127); } else { int status; if (waitpid(pid, &status, 0) < 0) { fprintf(stderr, "Unable to wait for launch command: %s\n", strerror(errno)); } else if (WIFSIGNALED(status)) { int sig = WTERMSIG(status); fprintf(stderr, "Launch command terminated with signal " "%d%s%s%s\n", sig, #if HAVE_STRSIGNAL " (", strsignal(sig), ")" #else "", "", "" #endif ); } else { int exitcode = WEXITSTATUS(status); if (exitcode) { fprintf(stderr, "Launch command terminated with status %d\n", exitcode); } } } } sfree(url); /* * Now construct fd structure(s) to hold the listening sockets. */ if (lfds.v4 >= 0) new_fdstruct(lfds.v4, FD_LISTENER); if (lfds.v6 >= 0) new_fdstruct(lfds.v6, FD_LISTENER); if (dcfg->closeoneof) { /* * Read from standard input, and treat EOF as a notification * to exit. */ new_fdstruct(0, FD_CLIENT); } /* * Now we're ready to run our main loop. Keep looping round on * select. */ while (1) { fd_set rfds, wfds; int i, j; int maxfd; int ret; #define FD_SET_MAX(fd, set, max) \ do { FD_SET((fd),(set)); (max) = ((max)<=(fd)?(fd)+1:(max)); } while(0) /* * Loop round the fd list putting fds into our select * sets. Also in this loop we remove any that were marked * as deleted in the previous loop. */ FD_ZERO(&rfds); FD_ZERO(&wfds); maxfd = 0; for (i = j = 0; j < nfds; j++) { if (fds[j].deleted) { sfree(fds[j].wdata); free_connection(fds[j].cctx); continue; } fds[i] = fds[j]; switch (fds[i].type) { case FD_CLIENT: FD_SET_MAX(fds[i].fd, &rfds, maxfd); break; case FD_LISTENER: FD_SET_MAX(fds[i].fd, &rfds, maxfd); break; case FD_CONNECTION: /* * Always read from a connection socket. Even * after we've started writing, the peer might * still be sending (e.g. because we shamefully * jumped the gun before waiting for the end of * the HTTP request) and so we should be prepared * to read data and throw it away. */ FD_SET_MAX(fds[i].fd, &rfds, maxfd); /* * Also attempt to write, if we have data to write. */ if (fds[i].wdatapos < fds[i].wdatalen) FD_SET_MAX(fds[i].fd, &wfds, maxfd); break; } i++; } nfds = i; ret = select(maxfd, &rfds, &wfds, NULL, NULL); if (ret <= 0) { if (ret < 0 && (errno != EINTR)) { fprintf(stderr, "select: %s", strerror(errno)); exit(1); } continue; } for (i = 0; i < nfds; i++) { switch (fds[i].type) { case FD_CLIENT: if (FD_ISSET(fds[i].fd, &rfds)) { char buf[4096]; int ret = read(fds[i].fd, buf, sizeof(buf)); if (ret <= 0) { if (ret < 0) { fprintf(stderr, "standard input: read: %s\n", strerror(errno)); exit(1); } return; } } break; case FD_LISTENER: if (FD_ISSET(fds[i].fd, &rfds)) { /* * New connection has come in. Accept it. */ struct fd *f; struct sockaddr_in addr; socklen_t addrlen = sizeof(addr); int newfd = accept(fds[i].fd, (struct sockaddr *)&addr, &addrlen); if (newfd < 0) break; /* not sure what happened there */ f = new_fdstruct(newfd, FD_CONNECTION); f->cctx = new_connection(t); if (authtype == HTTPD_AUTH_MAGIC) check_magic_access(f); } break; case FD_CONNECTION: if (FD_ISSET(fds[i].fd, &rfds)) { /* * There's data to be read. */ char readbuf[4096]; int ret; ret = read(fds[i].fd, readbuf, sizeof(readbuf)); if (ret <= 0) { /* * This shouldn't happen in a sensible * HTTP connection, so we abandon the * connection if it does. */ close(fds[i].fd); fds[i].deleted = true; break; } else { if (!fds[i].wdata) { /* * If we haven't got an HTTP response * yet, keep processing data in the * hope of acquiring one. */ fds[i].wdata = got_data (fds[i].cctx, readbuf, ret, (authtype == HTTPD_AUTH_NONE || fds[i].magic_access), authstring, &cfg); if (fds[i].wdata) { fds[i].wdatalen = strlen(fds[i].wdata); fds[i].wdatapos = 0; } } else { /* * Otherwise, just drop our read data * on the floor. */ } } } if (FD_ISSET(fds[i].fd, &wfds) && fds[i].wdatapos < fds[i].wdatalen) { /* * The socket is writable, and we have data to * write. Write it. */ int ret = write(fds[i].fd, fds[i].wdata + fds[i].wdatapos, fds[i].wdatalen - fds[i].wdatapos); if (ret <= 0) { /* * Shouldn't happen; abandon the connection. */ close(fds[i].fd); fds[i].deleted = true; break; } else { fds[i].wdatapos += ret; if (fds[i].wdatapos == fds[i].wdatalen) { shutdown(fds[i].fd, SHUT_WR); } } } break; } } } } agedu-20211129.8cd63c5/html.h0000644000175000017500000001404014151034324014265 0ustar simonsimon/* * html.h: HTML output format for agedu. */ struct html_config { /* * Configure the format of the URI pathname fragment corresponding * to a given tree entry. * * 'uriformat' is expected to have the following format: * - it consists of one or more _options_, each indicating a * particular way to format a URI, separated by '%|' * - each option contains _at most one_ formatting directive; * without any, it is assumed to only be able to encode the * root tree entry * - the formatting directive may be followed before and/or * afterwards with literal text; percent signs in that literal * text are specified as %% (which doesn't count as a * formatting directive for the 'at most one' rule) * - formatting directives are as follows: * + '%n' outputs the numeric index (in decimal) of the tree * entry * + '%p' outputs the pathname of the tree entry, not counting * any common prefix of the whole tree or a subdirectory * separator following that (so that the root directory of * the tree will always be rendered as the empty string). * The subdirectory separator is translated into '/'; any * remotely worrying character is escaped as = followed by * two hex digits (including, in particular, = itself). The * only characters not escaped are the ASCII alphabets and * numbers, the subdirectory separator as mentioned above, * and the four punctuation characters -.@_ (with the * exception that at the very start of a pathname, even '.' * is escaped). * - '%/p' outputs the pathname of the tree entry, but this time * the subdirectory separator is also considered to be a * worrying character and is escaped. * - '%-p' and '%-/p' are like '%p' and '%/p' respectively, * except that they use the full pathname stored in the tree * without stripping a common prefix. * * These formats are used both for generating and parsing URI * fragments. When generating, the first valid option is used * (which is always the very first one if we're generating the * root URI, or else it's the first option with any formatting * directive); when parsing, the first option that matches will be * accepted. (Thus, you can have '.../subdir' and '.../subdir/' * both accepted, but make the latter canonical; clients of this * mechanism will typically regenerate a URI string after parsing * an index out of it, and return an HTTP redirect if it isn't in * canonical form.) * * All hyperlinks should be correctly generated as relative (i.e. * with the right number of ../ and ./ considering both the * pathname for the page currently being generated, and the one * for the link target). * * If 'uriformat' is NULL, the HTML is generated without hyperlinks. */ const char *uriformat; /* * Configure the filenames output by html_dump(). These can be * configured separately from the URI formats, so that the root * file can be called index.html on disk but have a notional URI * of just / or similar. * * Formatting directives are the same as the uriformat above. */ const char *fileformat; /* * Time stamps to assign to the extreme ends of the colour * scale. If "autoage" is true, they are ignored and the time * stamps are derived from the limits of the age data stored * in the index. */ bool autoage; time_t oldest, newest; /* * Specify whether to show individual files as well as * directories in the report. */ bool showfiles; /* * The string appearing in the part of HTML pages, before * a colon followed by the path being served. Default is "agedu". */ const char *html_title; }; /* * Parse a URI pathname segment against the URI formats specified in * 'cfg', and return a numeric index in '*index'. Return value is true * on success, or false if the pathname makes no sense, or the index * is out of range, or the index does not correspond to a directory in * the trie. */ int html_parse_path(const void *t, const char *path, const struct html_config *cfg, unsigned long *index); /* * Generate a URI pathname segment from an index. */ char *html_format_path(const void *t, const struct html_config *cfg, unsigned long index); /* * Generate an HTML document containing the results of a query * against the pathname at a given index. Returns a dynamically * allocated piece of memory containing the entire HTML document, * as an ordinary C zero-terminated string. * * 'downlink' is true if hyperlinks should be generated for * subdirectories. (This can also be disabled by setting cfg->format * to NULL, but that also disables the upward hyperlinks to parent * directories. Setting cfg->format to non-NULL but downlink to false * will generate uplinks but no downlinks.) */ char *html_query(const void *t, unsigned long index, const struct html_config *cfg, bool downlink); /* * Recursively output a dump of lots of HTML files which crosslink * to each other. cfg->format and cfg->rootpath will be used to * generate the filenames for both the hyperlinks and the output * file names; the file names will have "pathprefix" prepended to * them before being opened. * * "index" and "endindex" point to the region of index file that * should be generated by the dump, which must be a subdirectory. * * "maxdepth" limits the depth of recursion. Setting it to zero * outputs only one page, 1 outputs the current directory and its * immediate children but no further, and so on. Making it negative * gives unlimited depth. * * Return value is 0 on success, or 1 if an error occurs during * output. */ int html_dump(const void *t, unsigned long index, unsigned long endindex, int maxdepth, const struct html_config *cfg, const char *pathprefix); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/html.c�����������������������������������������������������������������������0000644�0001750�0001750�00000102301�14151034324�014256� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/* * html.c: implementation of html.h. */ #include "agedu.h" #include "html.h" #include "alloc.h" #include "trie.h" #include "index.h" #define MAXCOLOUR 511 struct html { char *buf; size_t buflen, bufsize; const void *t; unsigned long long totalsize, oldest, newest; char *path2; char *oururi; size_t hreflen; const char *uriformat; unsigned long long thresholds[MAXCOLOUR]; char *titletexts[MAXCOLOUR+1]; time_t now; }; static void vhtprintf(struct html *ctx, const char *fmt, va_list ap) { va_list ap2; int size, size2; char testbuf[2]; va_copy(ap2, ap); /* * Some C libraries (Solaris, I'm looking at you) don't like * an output buffer size of zero in vsnprintf, but will return * sensible values given any non-zero buffer size. Hence, we * use testbuf to gauge the length of the string. */ size = vsnprintf(testbuf, 1, fmt, ap2); va_end(ap2); if (ctx->buflen + size >= ctx->bufsize) { ctx->bufsize = (ctx->buflen + size) * 3 / 2 + 1024; ctx->buf = sresize(ctx->buf, ctx->bufsize, char); } size2 = vsnprintf(ctx->buf + ctx->buflen, ctx->bufsize - ctx->buflen, fmt, ap); assert(size == size2); ctx->buflen += size; } static void htprintf(struct html *ctx, const char *fmt, ...) { va_list ap; va_start(ap, fmt); vhtprintf(ctx, fmt, ap); va_end(ap); } static unsigned long long round_and_format_age(struct html *ctx, unsigned long long age, char *buf, int direction) { struct tm tm, tm2; char newbuf[80]; unsigned long long ret, newret; int i; int ym; static const int minutes[] = { 5, 10, 15, 30, 45 }; tm = *localtime(&ctx->now); ym = tm.tm_year * 12 + tm.tm_mon; ret = ctx->now; strcpy(buf, "Now"); for (i = 0; i < lenof(minutes); i++) { newret = ctx->now - minutes[i] * 60; sprintf(newbuf, "%d minutes", minutes[i]); if (newret < age) goto finish; strcpy(buf, newbuf); ret = newret; } for (i = 1; i < 24; i++) { newret = ctx->now - i * (60*60); sprintf(newbuf, "%d hour%s", i, i==1 ? "" : "s"); if (newret < age) goto finish; strcpy(buf, newbuf); ret = newret; } for (i = 1; i < 7; i++) { newret = ctx->now - i * (24*60*60); sprintf(newbuf, "%d day%s", i, i==1 ? "" : "s"); if (newret < age) goto finish; strcpy(buf, newbuf); ret = newret; } for (i = 1; i < 4; i++) { newret = ctx->now - i * (7*24*60*60); sprintf(newbuf, "%d week%s", i, i==1 ? "" : "s"); if (newret < age) goto finish; strcpy(buf, newbuf); ret = newret; } for (i = 1; i < 11; i++) { tm2 = tm; /* structure copy */ tm2.tm_year = (ym - i) / 12; tm2.tm_mon = (ym - i) % 12; newret = mktime(&tm2); sprintf(newbuf, "%d month%s", i, i==1 ? "" : "s"); if (newret < age) goto finish; strcpy(buf, newbuf); ret = newret; } for (i = 1;; i++) { tm2 = tm; /* structure copy */ tm2.tm_year = (ym - i*12) / 12; tm2.tm_mon = (ym - i*12) % 12; newret = mktime(&tm2); sprintf(newbuf, "%d year%s", i, i==1 ? "" : "s"); if (newret < age) goto finish; strcpy(buf, newbuf); ret = newret; } finish: if (direction > 0) { /* * Round toward newest, i.e. use the existing (buf,ret). */ } else if (direction < 0) { /* * Round toward oldest, i.e. use (newbuf,newret); */ strcpy(buf, newbuf); ret = newret; } else { /* * Round to nearest. */ if (ret - age > age - newret) { strcpy(buf, newbuf); ret = newret; } } return ret; } static void get_indices(const void *t, char *path, unsigned long *xi1, unsigned long *xi2) { size_t pathlen = strlen(path); int c1 = path[pathlen], c2 = (pathlen > 0 ? path[pathlen-1] : 0); *xi1 = trie_before(t, path); make_successor(path); *xi2 = trie_before(t, path); path[pathlen] = c1; if (pathlen > 0) path[pathlen-1] = c2; } static unsigned long long fetch_size(const void *t, unsigned long xi1, unsigned long xi2, unsigned long long atime) { if (xi2 - xi1 == 1) { /* * We are querying an individual file, so we should not * depend on the index entries either side of the node, * since they almost certainly don't both exist. Instead, * just look up the file's size and atime in the main trie. */ const struct trie_file *f = trie_getfile(t, xi1); if (f->atime < atime) return f->size; else return 0; } else { return index_query(t, xi2, atime) - index_query(t, xi1, atime); } } static void htescape(struct html *ctx, const char *s, int n, int italics) { while (n > 0 && *s) { unsigned char c = (unsigned char)*s++; if (c == '&') htprintf(ctx, "&"); else if (c == '<') htprintf(ctx, "<"); else if (c == '>') htprintf(ctx, ">"); else if (c >= ' ' && c < '\177') htprintf(ctx, "%c", c); else { if (italics) htprintf(ctx, "<i>"); htprintf(ctx, "[%02x]", c); if (italics) htprintf(ctx, "</i>"); } n--; } } static void begin_colour_bar(struct html *ctx) { htprintf(ctx, "<table cellspacing=0 cellpadding=0" " style=\"border:0\">\n<tr>\n"); } static void add_to_colour_bar(struct html *ctx, int colour, int pixels) { int r, g, b; if (colour >= 0 && colour < 256) /* red -> yellow fade */ r = 255, g = colour, b = 0; else if (colour >= 256 && colour <= 511) /* yellow -> green fade */ r = 511 - colour, g = 255, b = 0; else /* background grey */ r = g = b = 240; if (pixels > 0) { htprintf(ctx, "<td style=\"width:%dpx; height:1em; " "background-color:#%02x%02x%02x\"", pixels, r, g, b); if (colour >= 0) htprintf(ctx, " title=\"%s\"", ctx->titletexts[colour]); htprintf(ctx, "></td>\n"); } } static void end_colour_bar(struct html *ctx) { htprintf(ctx, "</tr>\n</table>\n"); } struct vector { bool want_href, essential; char *name; bool literal; /* should the name be formatted in fixed-pitch? */ unsigned long index; unsigned long long sizes[MAXCOLOUR+1]; }; int vec_compare(const void *av, const void *bv) { const struct vector *a = *(const struct vector **)av; const struct vector *b = *(const struct vector **)bv; if (a->sizes[MAXCOLOUR] > b->sizes[MAXCOLOUR]) return -1; else if (a->sizes[MAXCOLOUR] < b->sizes[MAXCOLOUR]) return +1; else if (a->want_href < b->want_href) return +1; else if (a->want_href > b->want_href) return -1; else if (a->want_href) return strcmp(a->name, b->name); else if (a->index < b->index) return -1; else if (a->index > b->index) return +1; else if (a->essential < b->essential) return +1; else if (a->essential > b->essential) return -1; return 0; } static struct vector *make_vector(struct html *ctx, char *path, bool want_href, bool essential, char *name, bool literal) { unsigned long xi1, xi2; struct vector *vec = snew(struct vector); int i; vec->want_href = want_href; vec->essential = essential; vec->name = name ? dupstr(name) : NULL; vec->literal = literal; get_indices(ctx->t, path, &xi1, &xi2); vec->index = xi1; for (i = 0; i <= MAXCOLOUR; i++) { unsigned long long atime; if (i == MAXCOLOUR) atime = ULLONG_MAX; else atime = ctx->thresholds[i]; vec->sizes[i] = fetch_size(ctx->t, xi1, xi2, atime); } return vec; } static void print_heading(struct html *ctx, const char *title) { htprintf(ctx, "<tr style=\"padding: 0.2em; background-color:#e0e0e0\">\n" "<td colspan=4 align=center>%s</td>\n</tr>\n", title); } static void compute_display_size(unsigned long long size, const char **fmt, double *display_size) { static const char *const fmts[] = { "%g B", "%g kB", "%#.1f MB", "%#.1f GB", "%#.1f TB", "%#.1f PB", "%#.1f EB", "%#.1f ZB", "%#.1f YB" }; int shift = 0; unsigned long long tmpsize; double denominator; tmpsize = size; denominator = 1.0; while (tmpsize >= 1024 && shift < lenof(fmts)-1) { tmpsize >>= 10; denominator *= 1024.0; shift++; } *display_size = size / denominator; *fmt = fmts[shift]; } struct format_option { const char *prefix, *suffix; /* may include '%%' */ int prefixlen, suffixlen; /* does not count '%%' */ char fmttype; /* 0 for none, or 'n' or 'p' */ bool translate_pathsep; /* pathsep rendered as '/'? */ bool shorten_path; /* omit common prefix? */ }; /* * Gets the next format option from a format string. Advances '*fmt' * past it, or sets it to NULL if nothing is left. */ struct format_option get_format_option(const char **fmt) { struct format_option ret; /* * Scan for prefix of format. */ ret.prefix = *fmt; ret.prefixlen = 0; while (1) { if (**fmt == '\0') { /* * No formatting directive, and this is the last option. */ ret.suffix = *fmt; ret.suffixlen = 0; ret.fmttype = '\0'; *fmt = NULL; return ret; } else if (**fmt == '%') { if ((*fmt)[1] == '%') { (*fmt) += 2; /* just advance one extra */ ret.prefixlen++; } else if ((*fmt)[1] == '|') { /* * No formatting directive. */ ret.suffix = *fmt; ret.suffixlen = 0; ret.fmttype = '\0'; (*fmt) += 2; /* advance to start of next option */ return ret; } else { break; } } else { (*fmt)++; /* normal character */ ret.prefixlen++; } } /* * Interpret formatting directive with flags. */ (*fmt)++; ret.translate_pathsep = true; ret.shorten_path = true; while (1) { char c = *(*fmt)++; assert(c); if (c == '/') { ret.translate_pathsep = false; } else if (c == '-') { ret.shorten_path = false; } else { assert(c == 'n' || c == 'p'); ret.fmttype = c; break; } } /* * Scan for suffix. */ ret.suffix = *fmt; ret.suffixlen = 0; while (1) { if (**fmt == '\0') { /* * This is the last option. */ *fmt = NULL; return ret; } else if (**fmt != '%') { (*fmt)++; /* normal character */ ret.suffixlen++; } else { if ((*fmt)[1] == '%') { (*fmt) += 2; /* just advance one extra */ ret.suffixlen++; } else { assert((*fmt)[1] == '|'); (*fmt) += 2; /* advance to start of next option */ return ret; } } } } char *format_string_inner(const char *fmt, int nescape, unsigned long index, const void *t) { int maxlen; char *ret = NULL, *p = NULL; char *path = NULL, *q = NULL; char pathsep = trie_pathsep(t); int maxpathlen = trie_maxpathlen(t); int charindex; while (fmt) { struct format_option opt = get_format_option(&fmt); if (index && !opt.fmttype) continue; /* option is only good for the root, which this isn't */ maxlen = opt.prefixlen + opt.suffixlen + 1; switch (opt.fmttype) { case 'n': maxlen += 40; /* generous length for an integer */ break; case 'p': maxlen += 3*maxpathlen; /* might have to escape everything */ break; } ret = snewn(maxlen, char); p = ret; while (opt.prefixlen-- > 0) { if ((*p++ = *opt.prefix++) == '%') opt.prefix++; } switch (opt.fmttype) { case 'n': p += sprintf(p, "%lu", index); break; case 'p': path = snewn(1+trie_maxpathlen(t), char); if (opt.shorten_path) { trie_getpath(t, 0, path); q = path + strlen(path); trie_getpath(t, index, path); if (*q == pathsep) q++; } else { trie_getpath(t, index, path); q = path; } charindex = 0; while (*q) { char c = *q++; if (c == pathsep && opt.translate_pathsep) { *p++ = '/'; charindex = 0; } else if (charindex < nescape || (!isalnum((unsigned char)c) && ((charindex == 0 && c=='.') || !strchr("-.@_", c)))) { p += sprintf(p, "=%02X", (unsigned char)c); charindex++; } else { *p++ = c; charindex++; } } sfree(path); break; } while (opt.suffixlen-- > 0) { if ((*p++ = *opt.suffix++) == '%') opt.suffix++; } *p = '\0'; assert(p - ret < maxlen); return ret; } assert(!"Getting here implies an incomplete set of formats"); } int parse_path(const void *t, const char *path, const char *fmt, unsigned long *index) { int len = strlen(path); int midlen; const char *p, *q; char *r; char pathsep = trie_pathsep(t); while (fmt) { struct format_option opt = get_format_option(&fmt); /* * Check prefix and suffix. */ midlen = len - opt.prefixlen - opt.suffixlen; if (midlen < 0) continue; /* prefix and suffix don't even fit */ p = path; while (opt.prefixlen > 0) { char c = *opt.prefix++; if (c == '%') opt.prefix++; if (*p != c) break; p++; opt.prefixlen--; } if (opt.prefixlen > 0) continue; /* prefix didn't match */ q = path + len - opt.suffixlen; while (opt.suffixlen > 0) { char c = *opt.suffix++; if (c == '%') opt.suffix++; if (*q != c) break; q++; opt.suffixlen--; } if (opt.suffixlen > 0) continue; /* suffix didn't match */ /* * Check the data in between. p points at it, and it's midlen * characters long. */ if (opt.fmttype == '\0') { if (midlen == 0) { /* * Successful match against a root format. */ *index = 0; return 1; } } else if (opt.fmttype == 'n') { *index = 0; while (midlen > 0) { if (*p >= '0' && *p <= '9') *index = *index * 10 + (*p - '0'); else break; midlen--; p++; } if (midlen == 0) { /* * Successful match against a numeric format. */ return 1; } } else { assert(opt.fmttype == 'p'); int maxoutlen = trie_maxpathlen(t) + 1; int maxinlen = midlen + 1; char triepath[maxinlen+maxoutlen]; if (opt.shorten_path) { trie_getpath(t, 0, triepath); r = triepath + strlen(triepath); if (r > triepath && r[-1] != pathsep) *r++ = pathsep; } else { r = triepath; } while (midlen > 0) { if (*p == '/' && opt.translate_pathsep) { *r++ = pathsep; p++; midlen--; } else if (*p == '=') { /* * We intentionally do not check whether the * escaped character _should_ have been escaped * according to the rules in html_format_path. * * All clients of this parsing function, after a * successful parse, call html_format_path to find * the canonical URI for the same index and return * an HTTP redirect if the provided URI was not * exactly equal to that canonical form. This is * critical when the correction involves adding or * removing a trailing slash (because then * relative hrefs on the generated page can be * computed with respect to the canonical URI * instead of having to remember what the actual * URI was), but also has the useful effect that * if a user attempts to type in (guess) a URI by * hand they don't have to remember the escaping * rules - as long as they type _something_ that * this code can parse into a recognisable * pathname, it will be automatically 301ed into * the canonical form. */ if (midlen < 3 || !isxdigit((unsigned char)p[1]) || !isxdigit((unsigned char)p[2])) break; /* faulty escape encoding */ char x[3]; unsigned cval; x[0] = p[1]; x[1] = p[2]; x[2] = '\0'; sscanf(x, "%x", &cval); *r++ = cval; p += 3; midlen -= 3; } else { *r++ = *p; p++; midlen--; } } if (midlen > 0) continue; /* something went wrong in that loop */ assert(r - triepath < maxinlen+maxoutlen); *r = '\0'; unsigned long gotidx = trie_before(t, triepath); if (gotidx >= trie_count(t)) continue; /* index out of range */ char retpath[1+maxoutlen]; trie_getpath(t, gotidx, retpath); if (strcmp(triepath, retpath)) continue; /* exact path not found in trie */ if (!index_has_root(t, gotidx)) continue; /* path is not a directory */ /* * Successful path-based match. */ *index = gotidx; return 1; } } return 0; /* no match from any format option */ } char *format_string(const char *fmt, unsigned long index, const void *t) { unsigned long indexout; char *ret; int nescape = 0; /* * Format the string using whichever format option first works. */ ret = format_string_inner(fmt, 0, index, t); /* * Now re-_parse_ the string, to see if it gives the same index * back. It might not, if a pathname is valid in two formats: for * instance, if you use '-H -d max' to generate a static HTML dump * from scanning a directory which has a subdir called 'index', * you might well find that the top-level file wants to be called * index.html and so does the one for that subdir. * * We fix this by formatting the string again with more and more * characters escaped, so that the non-root 'index.html' becomes * (e.g.) '=69ndex.html', or '=69=6edex.html' if that doesn't * work, etc. */ while (1) { /* * Parse the pathname and see if it gives the right index. */ int parseret = parse_path(t, ret, fmt, &indexout); assert(parseret != 0); if (indexout == index) break; /* path now parses successfully */ /* * If not, try formatting it again. */ char *new = format_string_inner(fmt, ++nescape, index, t); assert(strcmp(new, ret)); /* if nescape gets too big, give up */ sfree(ret); ret = new; } return ret; } char *html_format_path(const void *t, const struct html_config *cfg, unsigned long index) { return format_string(cfg->uriformat, index, t); } int html_parse_path(const void *t, const char *path, const struct html_config *cfg, unsigned long *index) { return parse_path(t, path, cfg->uriformat, index); } char *make_href(const char *source, const char *target) { /* * We insist that both source and target URIs start with a /, or * else we won't be reliably able to construct relative hrefs * between them (e.g. because we've got a suffix on the end of * some CGI pathname that this function doesn't know the final * component of). */ assert(*source == '/'); assert(*target == '/'); /* * Find the last / in source. Everything up to but not including * that is the directory to which the output href will be * relative. We enforce by assertion that there must be a / * somewhere in source, or else we can't construct a relative href * at all */ const char *sourceend = strrchr(source, '/'); assert(sourceend != NULL); /* * See how far the target URI agrees with the source one, up to * and including that /. */ const char *s = source, *t = target; while (s <= sourceend && *s == *t) s++, t++; /* * We're only interested in agreement of complete path components, * so back off until we're sitting just after a shared /. */ while (s > source && s[-1] != '/') s--, t--; assert(s > source); /* * Now we need some number of levels of "../" to get from source * to here, and then we just replicate the rest of 'target'. */ int levels = 0; while (s <= sourceend) { if (*s == '/') levels++; s++; } int len = 3*levels + strlen(t); if (len == 0) { /* One last special case: if target has no tail _and_ we * haven't written out any "../". */ return dupstr("./"); } else { char *ret = snewn(len+1, char); char *p = ret; while (levels-- > 0) { *p++ = '.'; *p++ = '.'; *p++ = '/'; } strcpy(p, t); return ret; } } #define PIXEL_SIZE 600 /* FIXME: configurability? */ static void write_report_line(struct html *ctx, struct vector *vec) { unsigned long long size, asize, divisor; double display_size; int pix, newpix; int i; const char *unitsfmt; /* * A line with literally zero space usage should not be * printed at all if it's a link to a subdirectory (since it * probably means the whole thing was excluded by some * --exclude-path wildcard). If it's [files] or the top-level * line, though, we must always print _something_, and in that * case we must fiddle about to prevent divisions by zero in * the code below. */ if (!vec->sizes[MAXCOLOUR] && !vec->essential) return; divisor = ctx->totalsize; if (!divisor) { divisor = 1; } /* * Find the total size of this subdirectory. */ size = vec->sizes[MAXCOLOUR]; compute_display_size(size, &unitsfmt, &display_size); htprintf(ctx, "<tr>\n" "<td style=\"padding: 0.2em; text-align: right\">"); htprintf(ctx, unitsfmt, display_size); htprintf(ctx, "</td>\n"); /* * Generate a colour bar. */ htprintf(ctx, "<td style=\"padding: 0.2em\">\n"); begin_colour_bar(ctx); pix = 0; for (i = 0; i <= MAXCOLOUR; i++) { asize = vec->sizes[i]; newpix = asize * PIXEL_SIZE / divisor; add_to_colour_bar(ctx, i, newpix - pix); pix = newpix; } add_to_colour_bar(ctx, -1, PIXEL_SIZE - pix); end_colour_bar(ctx); htprintf(ctx, "</td>\n"); /* * Output size as a percentage of totalsize. */ htprintf(ctx, "<td style=\"padding: 0.2em; text-align: right\">" "%.2f%%</td>\n", (double)size / divisor * 100.0); /* * Output a subdirectory marker. */ htprintf(ctx, "<td style=\"padding: 0.2em\">"); if (vec->name) { bool doing_href = false; if (ctx->uriformat && vec->want_href) { char *targeturi = format_string(ctx->uriformat, vec->index, ctx->t); char *link = make_href(ctx->oururi, targeturi); htprintf(ctx, "<a href=\"%s\">", link); sfree(link); sfree(targeturi); doing_href = true; } if (vec->literal) htprintf(ctx, "<code>"); htescape(ctx, vec->name, strlen(vec->name), 1); if (vec->literal) htprintf(ctx, "</code>"); if (doing_href) htprintf(ctx, "</a>"); } htprintf(ctx, "</td>\n</tr>\n"); } int strcmptrailingpathsep(const char *a, const char *b) { while (*a == *b && *a) a++, b++; if ((*a == pathsep && !a[1] && !*b) || (*b == pathsep && !b[1] && !*a)) return 0; return (int)(unsigned char)*a - (int)(unsigned char)*b; } char *html_query(const void *t, unsigned long index, const struct html_config *cfg, bool downlink) { struct html actx, *ctx = &actx; char *path, *path2, *p, *q; char agebuf1[80], agebuf2[80]; size_t pathlen, subdirpos; unsigned long index2; int i; struct vector **vecs; int nvecs, vecsize; unsigned long xi1, xi2, xj1, xj2; if (index >= trie_count(t)) return NULL; ctx->buf = NULL; ctx->buflen = ctx->bufsize = 0; ctx->t = t; ctx->uriformat = cfg->uriformat; htprintf(ctx, "<html>\n"); path = snewn(1+trie_maxpathlen(t), char); ctx->path2 = path2 = snewn(1+trie_maxpathlen(t), char); if (cfg->uriformat) ctx->oururi = format_string(cfg->uriformat, index, t); else ctx->oururi = NULL; /* * HEAD section. */ htprintf(ctx, "<head>\n"); trie_getpath(t, index, path); htprintf(ctx, "<title>"); htescape(ctx, cfg->html_title, strlen(cfg->html_title), 0); htprintf(ctx, ": "); htescape(ctx, path, strlen(path), 0); htprintf(ctx, "\n"); htprintf(ctx, "\n"); /* * Begin BODY section. */ htprintf(ctx, "\n"); htprintf(ctx, "

Disk space breakdown by" " last-access time

\n"); /* * Show the pathname we're centred on, with hyperlinks to * parent directories where available. */ htprintf(ctx, "

\n"); q = path; for (p = strchr(path, pathsep); p && p[1]; p = strchr(p, pathsep)) { int doing_href = 0; char c, *zp; /* * See if this path prefix exists in the trie. If so, * generate a hyperlink. */ zp = p; if (p == path) /* special case for "/" at start */ zp++; p++; c = *zp; *zp = '\0'; index2 = trie_before(t, path); trie_getpath(t, index2, path2); if (!strcmptrailingpathsep(path, path2) && cfg->uriformat) { char *targeturi = format_string(cfg->uriformat, index2, t); char *link = make_href(ctx->oururi, targeturi); htprintf(ctx, "", link); sfree(link); sfree(targeturi); doing_href = 1; } *zp = c; htescape(ctx, q, zp - q, 1); if (doing_href) htprintf(ctx, ""); htescape(ctx, zp, p - zp, 1); q = p; } htescape(ctx, q, strlen(q), 1); htprintf(ctx, "\n"); /* * Decide on the age limit of our colour coding, establish the * colour thresholds, and write out a key. */ ctx->now = time(NULL); if (cfg->autoage) { ctx->oldest = index_order_stat(t, 0.05); ctx->newest = index_order_stat(t, 1.0); ctx->oldest = round_and_format_age(ctx, ctx->oldest, agebuf1, -1); ctx->newest = round_and_format_age(ctx, ctx->newest, agebuf2, +1); } else { ctx->oldest = cfg->oldest; ctx->newest = cfg->newest; ctx->oldest = round_and_format_age(ctx, ctx->oldest, agebuf1, 0); ctx->newest = round_and_format_age(ctx, ctx->newest, agebuf2, 0); } for (i = 0; i < MAXCOLOUR; i++) { ctx->thresholds[i] = ctx->oldest + (ctx->newest - ctx->oldest) * i / (MAXCOLOUR-1); } for (i = 0; i <= MAXCOLOUR; i++) { char buf[80]; if (i == 0) { strcpy(buf, "> "); round_and_format_age(ctx, ctx->thresholds[0], buf+5, 0); } else if (i == MAXCOLOUR) { strcpy(buf, "< "); round_and_format_age(ctx, ctx->thresholds[MAXCOLOUR-1], buf+5, 0); } else { unsigned long long midrange = (ctx->thresholds[i-1] + ctx->thresholds[i]) / 2; round_and_format_age(ctx, midrange, buf, 0); } ctx->titletexts[i] = dupstr(buf); } htprintf(ctx, "

Key to colour coding (mouse over for more detail):\n"); htprintf(ctx, "

"); begin_colour_bar(ctx); htprintf(ctx, "%s\n", agebuf1); for (i = 0; i < MAXCOLOUR; i++) add_to_colour_bar(ctx, i, 1); htprintf(ctx, "%s\n", agebuf2); end_colour_bar(ctx); /* * Begin the main table. */ htprintf(ctx, "

\n\n"); /* * Find the total size of our entire subdirectory. We'll use * that as the scale for all the colour bars in this report. */ get_indices(t, path, &xi1, &xi2); ctx->totalsize = fetch_size(t, xi1, xi2, ULLONG_MAX); /* * Generate a report line for the whole subdirectory. */ vecsize = 64; vecs = snewn(vecsize, struct vector *); nvecs = 1; vecs[0] = make_vector(ctx, path, false, true, NULL, false); print_heading(ctx, "Overall"); write_report_line(ctx, vecs[0]); /* * Now generate report lines for all its children, and the * files contained in it. */ print_heading(ctx, "Subdirectories"); if (cfg->showfiles) { /* Every file directly in this directory is going to end up in * its own entry in the loop below, and then it'll be * subtracted from vecs[0]. So vecs[0] will end up tracking * the size of the _directory inode_ only. */ vecs[0]->name = dupstr("[directory]"); } else { /* Otherwise, vecs[0] will track everything that wasn't part * of a subdirectory, which includes the directory inode but * also all the files within it. Use this more general name. */ vecs[0]->name = dupstr("[files]"); } get_indices(t, path, &xi1, &xi2); xi1++; pathlen = strlen(path); subdirpos = pathlen + 1; if (pathlen > 0 && path[pathlen-1] == pathsep) subdirpos--; while (xi1 < xi2) { trie_getpath(t, xi1, path2); get_indices(t, ctx->path2, &xj1, &xj2); xi1 = xj2; if (!cfg->showfiles && xj2 - xj1 <= 1) continue; /* skip individual files */ if (nvecs >= vecsize) { vecsize = nvecs * 3 / 2 + 64; vecs = sresize(vecs, vecsize, struct vector *); } assert(strlen(path2) > pathlen); vecs[nvecs] = make_vector(ctx, path2, downlink && (xj2 - xj1 > 1), false, path2 + subdirpos, 1); for (i = 0; i <= MAXCOLOUR; i++) vecs[0]->sizes[i] -= vecs[nvecs]->sizes[i]; nvecs++; } qsort(vecs, nvecs, sizeof(vecs[0]), vec_compare); for (i = 0; i < nvecs; i++) if (vecs[i]->sizes[MAXCOLOUR]) write_report_line(ctx, vecs[i]); /* * Close the main table. */ htprintf(ctx, "
\n"); /* * Finish up and tidy up. */ htprintf(ctx, "\n"); htprintf(ctx, "\n"); sfree(ctx->oururi); sfree(path2); sfree(path); for (i = 0; i < nvecs; i++) { sfree(vecs[i]->name); sfree(vecs[i]); } sfree(vecs); return ctx->buf; } int html_dump(const void *t, unsigned long index, unsigned long endindex, int maxdepth, const struct html_config *cfg, const char *pathprefix) { /* * Determine the filename for this file. */ assert(cfg->fileformat != NULL); char *filename = format_string(cfg->fileformat, index, t); char *path = dupfmt("%s%s", pathprefix, filename); sfree(filename); /* * Create the HTML itself. Don't write out downlinks from our * deepest level. */ char *html = html_query(t, index, cfg, maxdepth != 0); /* * Write it out. */ FILE *fp = fopen(path, "w"); if (!fp) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, path, strerror(errno)); return 1; } if (fputs(html, fp) < 0) { fprintf(stderr, "%s: %s: write: %s\n", PNAME, path, strerror(errno)); fclose(fp); return 1; } if (fclose(fp) < 0) { fprintf(stderr, "%s: %s: fclose: %s\n", PNAME, path, strerror(errno)); return 1; } sfree(path); /* * Recurse. */ if (maxdepth != 0) { unsigned long subindex, subendindex; int newdepth = (maxdepth > 0 ? maxdepth - 1 : maxdepth); char rpath[1+trie_maxpathlen(t)]; index++; while (index < endindex) { trie_getpath(t, index, rpath); get_indices(t, rpath, &subindex, &subendindex); index = subendindex; if (subendindex - subindex > 1) { if (html_dump(t, subindex, subendindex, newdepth, cfg, pathprefix)) return 1; } } } return 0; } agedu-20211129.8cd63c5/fgetline.h0000644000175000017500000000053014151034324015115 0ustar simonsimon/* * fgetline.h: Utility function to read a complete line of text * from a file, allocating memory for it as large as necessary. */ /* * Read an entire line of text from a file. Return a dynamically * allocated buffer containing the complete line including * trailing newline. * * On error, returns NULL. */ char *fgetline(FILE *fp); agedu-20211129.8cd63c5/fgetline.c0000644000175000017500000000104114151034324015106 0ustar simonsimon/* * fgetline.c: implementation of fgetline.h. */ #include "agedu.h" #include "alloc.h" #include "fgetline.h" char *fgetline(FILE *fp) { char *ret = snewn(512, char); int size = 512, len = 0; while (fgets(ret + len, size - len, fp)) { len += strlen(ret + len); if (ret[len-1] == '\n') break; /* got a newline, we're done */ size = len + 512; ret = sresize(ret, size, char); } if (len == 0) { /* first fgets returned NULL */ sfree(ret); return NULL; } ret[len] = '\0'; return ret; } agedu-20211129.8cd63c5/dumpfile.h0000644000175000017500000000140714151034324015131 0ustar simonsimon#include #include typedef struct dumpfile_load_state dumpfile_load_state; typedef struct dumpfile_record { struct trie_file tf; const char *pathname; } dumpfile_record; dumpfile_load_state *dumpfile_load_init(FILE *fp, bool check_order); /* dumpfile_load_record returns -1 for bad format, 0 for EOF, 1 for success */ int dumpfile_load_record(dumpfile_load_state *dls, dumpfile_record *dr); FILE *dumpfile_load_finish(dumpfile_load_state *dls); char dumpfile_load_get_pathsep(dumpfile_load_state *dls); typedef struct dumpfile_write_state { FILE *fp; bool sortable; char pathsep; } dumpfile_write_state; bool dump_write_header(dumpfile_write_state *ws); bool dump_write_record(dumpfile_write_state *ws, const dumpfile_record *dr); agedu-20211129.8cd63c5/dumpfile.c0000644000175000017500000001475114151034324015132 0ustar simonsimon#include "agedu.h" #include "alloc.h" #include "trie.h" #include "dumpfile.h" #include "fgetline.h" #define DUMPHDR "agedu dump file. pathsep=" #define DUMPHDR_SORTABLE "0:agedu sortable dump file. pathsep=" struct dumpfile_load_state { FILE *fp; char pathsep; int lineno; char *prev_pathname; bool sortable; bool check_order; }; dumpfile_load_state *dumpfile_load_init(FILE *fp, bool check_order) { char *buf = fgetline(fp); if (!buf) { fprintf(stderr, "%s: EOF at start of dump file\n", PNAME); return NULL; } unsigned pathsep; bool sortable; buf[strcspn(buf, "\r\n")] = '\0'; if (1 == sscanf(buf, DUMPHDR "%x", &pathsep)) { sortable = false; } else if (1 == sscanf(buf, DUMPHDR_SORTABLE "%x", &pathsep)) { sortable = true; } else { sfree(buf); fprintf(stderr, "%s: header in dump file not recognised\n", PNAME); return NULL; } sfree(buf); dumpfile_load_state *dls = snew(dumpfile_load_state); dls->fp = fp; dls->pathsep = (char)pathsep; dls->lineno = 1; dls->prev_pathname = NULL; dls->sortable = sortable; dls->check_order = check_order; return dls; } FILE *dumpfile_load_finish(dumpfile_load_state *dls) { FILE *fp = dls->fp; sfree(dls->prev_pathname); sfree(dls); return fp; } char dumpfile_load_get_pathsep(dumpfile_load_state *dls) { return dls->pathsep; } int dumpfile_load_record(dumpfile_load_state *dls, dumpfile_record *dr) { dr->pathname = NULL; char *buf = fgetline(stdin); if (!buf) return 0; dls->lineno++; buf[strcspn(buf, "\r\n")] = '\0'; char *p = buf; if (dls->sortable) { if (buf[0] != '1' || buf[1] != ':') { fprintf(stderr, "%s: dump file line %d: could not find expected " "pathname prefix\n", PNAME, dls->lineno); return -1; } /* Skip over the sortably-encoded pathname looking for the * numeric fields after it */ while (*p && *p != ' ') p++; if (*p) { p++; } else { fprintf(stderr, "%s: dump file line %d: could not find space " "after encoded pathname\n", PNAME, dls->lineno); return -1; } } char *q = p; while (*p && *p != ' ') p++; if (!*p) { fprintf(stderr, "%s: dump file line %d: could not find size field\n", PNAME, dls->lineno); return -1; } *p++ = '\0'; dr->tf.size = strtoull(q, NULL, 10); q = p; while (*p && *p != ' ') p++; if (*p) *p++ = '\0'; dr->tf.atime = strtoull(q, NULL, 10); q = buf; if (!dls->sortable) { if (!*p) { fprintf(stderr, "%s: dump file line %d: could not find " "pathname field\n", PNAME, dls->lineno); return -1; } while (*p) { int c = *p; if (*p == '%') { int i; p++; c = 0; for (i = 0; i < 2; i++) { c *= 16; if (*p >= '0' && *p <= '9') c += *p - '0'; else if (*p >= 'A' && *p <= 'F') c += *p - ('A' - 10); else if (*p >= 'a' && *p <= 'f') c += *p - ('a' - 10); else { fprintf(stderr, "%s: dump file line %d: unable" " to parse hex escape\n", PNAME, dls->lineno); return -1; } p++; } } else { p++; } *q++ = c; } } else { /* Rewind to the start of the line for the pathname */ p = buf + 2; while (*p && *p != ' ') { if (*p == 'A') { *q++ = dls->pathsep; p++; } else if (p[0] >= 'B' && p[0] < 'B'+16 && p[1] >= 'b' && p[1] < 'b'+16) { *q++ = (p[0] - 'B') * 16 + (p[1] - 'b'); p += 2; } else { fprintf(stderr, "%s: dump file line %d: unable" " to parse encoded pathname\n", PNAME, dls->lineno); return -1; } } } *q = '\0'; if (dls->check_order && dls->prev_pathname && triecmp(dls->prev_pathname, buf, NULL, dls->pathsep) >= 0) { fprintf(stderr, "%s: dump file line %d: pathname sorts before " "pathname on previous line\n", PNAME, dls->lineno); return -1; } sfree(dls->prev_pathname); dls->prev_pathname = buf; dr->pathname = buf; return 1; } bool dump_write_header(dumpfile_write_state *ws) { if (!ws->sortable) { if (fprintf(ws->fp, DUMPHDR "%02x\n", (unsigned char)ws->pathsep) < 0) return false; } else { if (fprintf(ws->fp, DUMPHDR_SORTABLE "%02x\n", (unsigned char)ws->pathsep) < 0) return false; } return true; } bool dump_write_record(dumpfile_write_state *ws, const dumpfile_record *dr) { const char *p; if (ws->sortable) { /* Pathname comes first, and is hex-ishly encoded */ printf("1:"); /* prefix to sort after header line */ for (const char *p = dr->pathname; *p; p++) { if (*p == ws->pathsep) { if (putc('A', ws->fp) == EOF) return false; } else { unsigned char val = *p; char c1 = 'B' + (0xF & (val >> 4)); char c2 = 'b' + (0xF & (val >> 0)); if (putc(c1, ws->fp) == EOF || putc(c2, ws->fp) == EOF) return false; } } if (putc(' ', ws->fp) == EOF) return false; } if (fprintf(ws->fp, "%llu %llu", dr->tf.size, dr->tf.atime) < 0) return false; if (!ws->sortable) { if (putc(' ', ws->fp) == EOF) return false; /* Pathname comes second, and is %-escaped. */ for (p = dr->pathname; *p; p++) { if (*p >= ' ' && *p < 127 && *p != '%') { if (putc(*p, ws->fp) == EOF) return false; } else { if (fprintf(ws->fp, "%%%02x", (unsigned char)*p) < 0) return false; } } } if (putc('\n', ws->fp) == EOF) return false; return true; } agedu-20211129.8cd63c5/du.h0000644000175000017500000000153514151034324013736 0ustar simonsimon/* * du.h: the function which actually performs the disk scan. */ #include #include /* * Function called to report a file or directory, its size and its * last-access time. * * Returns non-zero if recursion should proceed into this file's * contents (if it's a directory); zero if it should not. If the * file isn't a directory, the return value is ignored. */ typedef int (*gotdata_fn_t)(void *ctx, const char *pathname, const STRUCT_STAT *st); /* * Function called to report an error during scanning. The ctx is * the same one passed to gotdata_fn_t. */ typedef void (*err_fn_t)(void *vctx, const char *fmt, ...); /* * Recursively scan a directory tree and report every * space-consuming item in it to gotdata(). */ void du(const char *path, gotdata_fn_t gotdata, err_fn_t err, void *gotdata_ctx); agedu-20211129.8cd63c5/du.c0000644000175000017500000001545214151034324013734 0ustar simonsimon/* * du.c: implementation of du.h. */ #include "agedu.h" #include "du.h" #include "alloc.h" #if !defined __linux__ || !defined O_NOATIME || HAVE_FDOPENDIR #if HAVE_DIRENT_H # include #endif #if HAVE_NDIR_H # include #endif #if HAVE_SYS_DIR_H # include #endif #if HAVE_SYS_NDIR_H # include #endif /* * Wrappers around POSIX opendir, readdir and closedir, which * permit me to replace them with different wrappers in special * circumstances. */ typedef DIR *dirhandle; int open_dir(const char *path, dirhandle *dh) { #if defined O_NOATIME && HAVE_FDOPENDIR /* * On Linux, we have the O_NOATIME flag. This means we can * read the contents of directories without affecting their * atimes, which enables us to at least try to include them in * the age display rather than exempting them. * * Unfortunately, opendir() doesn't let us open a directory * with O_NOATIME. So instead, we have to open the directory * with vanilla open(), and then use fdopendir() to translate * the fd into a POSIX dir handle. */ int fd; fd = open(path, O_RDONLY | O_NONBLOCK | O_NOCTTY | O_LARGEFILE | O_NOATIME | O_DIRECTORY); if (fd < 0) { /* * Opening a file with O_NOATIME is not unconditionally * permitted by the Linux kernel. As far as I can tell, * it's permitted only for files on which the user would * have been able to call utime(2): in other words, files * for which the user could have deliberately set the * atime back to its original value after finishing with * it. Hence, O_NOATIME has no security implications; it's * simply a cleaner, faster and more race-condition-free * alternative to stat(), a normal open(), and a utimes() * when finished. * * The upshot of all of which, for these purposes, is that * we must be prepared to try again without O_NOATIME if * we receive EPERM. */ if (errno == EPERM) fd = open(path, O_RDONLY | O_NONBLOCK | O_NOCTTY | O_LARGEFILE | O_DIRECTORY); if (fd < 0) return -1; } *dh = fdopendir(fd); #else *dh = opendir(path); #endif if (!*dh) return -1; return 0; } const char *read_dir(dirhandle *dh) { struct dirent *de = readdir(*dh); return de ? de->d_name : NULL; } void close_dir(dirhandle *dh) { closedir(*dh); } #else /* defined __linux__ && defined O_NOATIME && !HAVE_FDOPENDIR */ /* * Earlier versions of glibc do not have fdopendir(). Therefore, * if we are on Linux and still wish to make use of O_NOATIME, we * have no option but to talk directly to the kernel system call * interface which underlies the POSIX opendir/readdir machinery. */ #define __KERNEL__ #include #include #include _syscall3(int, getdents, uint, fd, struct dirent *, dirp, uint, count) typedef struct { int fd; struct dirent data[32]; struct dirent *curr; int pos, endpos; } dirhandle; int open_dir(const char *path, dirhandle *dh) { /* * As above, we try with O_NOATIME and then fall back to * trying without it. */ dh->fd = open(path, O_RDONLY | O_NONBLOCK | O_NOCTTY | O_LARGEFILE | O_NOATIME | O_DIRECTORY); if (dh->fd < 0) { if (errno == EPERM) dh->fd = open(path, O_RDONLY | O_NONBLOCK | O_NOCTTY | O_LARGEFILE | O_DIRECTORY); if (dh->fd < 0) return -1; } dh->pos = dh->endpos = 0; return 0; } const char *read_dir(dirhandle *dh) { const char *ret; if (dh->pos >= dh->endpos) { dh->curr = dh->data; dh->pos = 0; dh->endpos = getdents(dh->fd, dh->data, sizeof(dh->data)); if (dh->endpos <= 0) return NULL; } ret = dh->curr->d_name; dh->pos += dh->curr->d_reclen; dh->curr = (struct dirent *)((char *)dh->data + dh->pos); return ret; } void close_dir(dirhandle *dh) { close(dh->fd); } #endif /* !defined __linux__ || !defined O_NOATIME || HAVE_FDOPENDIR */ static int str_cmp(const void *av, const void *bv) { return strcmp(*(const char **)av, *(const char **)bv); } static void du_recurse(char **path, size_t pathlen, size_t *pathsize, gotdata_fn_t gotdata, err_fn_t err, void *gotdata_ctx, int toplevel) { const char *name; dirhandle d; STRUCT_STAT st; char **names; size_t i, nnames, namesize; int statret; /* * Special case: at the very top of the scan, we follow a * symlink. */ if (toplevel) statret = STAT_FUNC(*path, &st); else statret = LSTAT_FUNC(*path, &st); if (statret < 0) { err(gotdata_ctx, "%s: lstat: %s\n", *path, strerror(errno)); return; } if (!gotdata(gotdata_ctx, *path, &st)) return; if (!S_ISDIR(st.st_mode)) return; names = NULL; nnames = namesize = 0; if (open_dir(*path, &d) < 0) { err(gotdata_ctx, "%s: opendir: %s\n", *path, strerror(errno)); return; } while ((name = read_dir(&d)) != NULL) { if (name[0] == '.' && (!name[1] || (name[1] == '.' && !name[2]))) { /* do nothing - we skip "." and ".." */ } else { if (nnames >= namesize) { namesize = nnames * 3 / 2 + 64; names = sresize(names, namesize, char *); } names[nnames++] = dupstr(name); } } close_dir(&d); if (nnames == 0) return; qsort(names, nnames, sizeof(*names), str_cmp); for (i = 0; i < nnames; i++) { /* * readdir(3) has occasionally been known to report two copies * of the identical file name, in cases involving strange file * system implementations or (possibly) race conditions. To * avoid failing an assertion in the trie code, de-duplicate. */ if (i+1 < nnames && !strcmp(names[i], names[i+1])) continue; size_t newpathlen = pathlen + 1 + strlen(names[i]); if (*pathsize <= newpathlen) { *pathsize = newpathlen * 3 / 2 + 256; *path = sresize(*path, *pathsize, char); } /* * Avoid duplicating a slash if we got a trailing one to * begin with (i.e. if we're starting the scan in '/' itself). */ if (pathlen > 0 && (*path)[pathlen-1] == '/') { strcpy(*path + pathlen, names[i]); newpathlen--; } else { sprintf(*path + pathlen, "/%s", names[i]); } du_recurse(path, newpathlen, pathsize, gotdata, err, gotdata_ctx, 0); sfree(names[i]); } sfree(names); } void du(const char *inpath, gotdata_fn_t gotdata, err_fn_t err, void *gotdata_ctx) { char *path; size_t pathlen, pathsize; pathlen = strlen(inpath); /* * Trim any trailing slashes from the input path, otherwise we'll * store them in the index with confusing effects. */ while (pathlen > 1 && inpath[pathlen-1] == '/') pathlen--; pathsize = pathlen + 256; path = snewn(pathsize, char); memcpy(path, inpath, pathlen); path[pathlen] = '\0'; du_recurse(&path, pathlen, &pathsize, gotdata, err, gotdata_ctx, 1); } agedu-20211129.8cd63c5/cmake.h.in0000644000175000017500000000227714151034324015017 0ustar simonsimon#cmakedefine01 HAVE_ASSERT_H #cmakedefine01 HAVE_ARPA_INET_H #cmakedefine01 HAVE_CTYPE_H #cmakedefine01 HAVE_DIRENT_H #cmakedefine01 HAVE_ERRNO_H #cmakedefine01 HAVE_FCNTL_H #cmakedefine01 HAVE_FEATURES_H #cmakedefine01 HAVE_FNMATCH_H #cmakedefine01 HAVE_LIMITS_H #cmakedefine01 HAVE_NDIR_H #cmakedefine01 HAVE_NETDB_H #cmakedefine01 HAVE_NETINET_IN_H #cmakedefine01 HAVE_PWD_H #cmakedefine01 HAVE_STDARG_H #cmakedefine01 HAVE_STDDEF_H #cmakedefine01 HAVE_STDINT_H #cmakedefine01 HAVE_STDIO_H #cmakedefine01 HAVE_STDBOOL_H #cmakedefine01 HAVE_STDLIB_H #cmakedefine01 HAVE_STRING_H #cmakedefine01 HAVE_SYS_DIR_H #cmakedefine01 HAVE_SYS_IOCTL_H #cmakedefine01 HAVE_SYS_MMAN_H #cmakedefine01 HAVE_SYS_NDIR_H #cmakedefine01 HAVE_SYS_SELECT_H #cmakedefine01 HAVE_SYS_SOCKET_H #cmakedefine01 HAVE_SYS_STAT_H #cmakedefine01 HAVE_SYS_TYPES_H #cmakedefine01 HAVE_SYS_WAIT_H #cmakedefine01 HAVE_SYSLOG_H #cmakedefine01 HAVE_TERMIOS_H #cmakedefine01 HAVE_TIME_H #cmakedefine01 HAVE_UNISTD_H #cmakedefine01 HAVE_FDOPENDIR #cmakedefine01 HAVE_GETADDRINFO #cmakedefine01 HAVE_GETHOSTBYNAME #cmakedefine01 HAVE_LSTAT64 #cmakedefine01 HAVE_STAT64 #cmakedefine01 HAVE_STRSIGNAL #cmakedefine01 AGEDU_IPV6 #cmakedefine01 AGEDU_IPV4 agedu-20211129.8cd63c5/alloc.h0000644000175000017500000000431714151034324014421 0ustar simonsimon/* * alloc.h: safe wrappers around malloc, realloc, free, strdup */ #ifndef AGEDU_ALLOC_H #define AGEDU_ALLOC_H #include /* * smalloc should guarantee to return a useful pointer. */ void *smalloc(size_t size); /* * srealloc should guaranteeably be able to realloc NULL */ void *srealloc(void *p, size_t size); /* * sfree should guaranteeably deal gracefully with freeing NULL */ void sfree(void *p); /* * dupstr is like strdup, but with the never-return-NULL property * of smalloc (and also reliably defined in all environments :-) */ char *dupstr(const char *s); /* * dupfmt is a bit like printf, but does its own allocation and * returns a dynamic string. It also supports a different (and * much less featureful) set of format directives: * * - %D takes no argument, and gives the current date and time in * a format suitable for an HTTP Date header * - %d takes an int argument and formats it like normal %d (but * doesn't support any of the configurability of standard * printf) * - %s takes a const char * and formats it like normal %s * (again, no fine-tuning available) * - %h takes a const char * but escapes it so that it's safe for * HTML * - %S takes a bool followed by a const char *. If the bool is * false, it behaves just like %s. If the bool is true, it * transforms the string by stuffing a \r before every \n. */ char *dupfmt(const char *fmt, ...); /* * snew allocates one instance of a given type, and casts the * result so as to type-check that you're assigning it to the * right kind of pointer. Protects against allocation bugs * involving allocating the wrong size of thing. */ #define snew(type) \ ( (type *) smalloc (sizeof (type)) ) /* * snewn allocates n instances of a given type, for arrays. */ #define snewn(number, type) \ ( (type *) smalloc ((number) * sizeof (type)) ) /* * sresize wraps realloc so that you specify the new number of * elements and the type of the element, with the same type- * checking advantages. Also type-checks the input pointer. */ #define sresize(array, number, type) \ ( (void)sizeof((array)-(type *)0), \ (type *) srealloc ((array), (number) * sizeof (type)) ) #endif /* AGEDU_ALLOC_H */ agedu-20211129.8cd63c5/alloc.c0000644000175000017500000000536214151034324014415 0ustar simonsimon/* * alloc.c: implementation of alloc.h */ #include "agedu.h" #include "alloc.h" extern void fatal(const char *, ...); void *smalloc(size_t size) { void *p; p = malloc(size); if (!p) { fatal("out of memory"); } return p; } void sfree(void *p) { if (p) { free(p); } } void *srealloc(void *p, size_t size) { void *q; if (p) { q = realloc(p, size); } else { q = malloc(size); } if (!q) fatal("out of memory"); return q; } char *dupstr(const char *s) { char *r = smalloc(1+strlen(s)); strcpy(r,s); return r; } char *dupfmt(const char *fmt, ...) { int pass; int totallen; char *ret = NULL, *rp = NULL; char datebuf[80]; va_list ap; time_t t; struct tm tm; bool got_time = false; datebuf[0] = '\0'; totallen = 0; for (pass = 0; pass < 2; pass++) { const char *p = fmt; va_start(ap, fmt); while (*p) { const char *data = NULL; int datalen = 0, stuffcr = 0, htmlesc = 0; if (*p == '%') { p++; if (*p == 'D') { if (!datebuf[0]) { if (!got_time) { t = time(NULL); tm = *gmtime(&t); got_time = true; } strftime(datebuf, lenof(datebuf), "%a, %d %b %Y %H:%M:%S GMT", &tm); } data = datebuf; datalen = strlen(data); } else if (*p == 'd') { int i = va_arg(ap, int); sprintf(datebuf, "%d", i); data = datebuf; datalen = strlen(data); } else if (*p == 's') { data = va_arg(ap, const char *); datalen = strlen(data); } else if (*p == 'h') { htmlesc = 1; data = va_arg(ap, const char *); datalen = strlen(data); } else if (assert(*p == 'S'), 1) { stuffcr = va_arg(ap, int); data = va_arg(ap, const char *); datalen = strlen(data); } p++; } else { data = p; while (*p && *p != '%') p++; datalen = p - data; } if (pass == 0) { while (datalen > 0) { totallen++; if (stuffcr && *data == '\n') totallen++; if (htmlesc && (*data == '<' || *data == '>' || *data == '&')) totallen += 4; /* max(len("gt;"),len("amp;")) */ data++, datalen--; } } else { while (datalen > 0) { if (htmlesc && (*data < 32 || *data >= 127)) *rp++ = '?'; /* *shrug* */ else if (htmlesc && *data == '<') rp += sprintf(rp, "<"); else if (htmlesc && *data == '>') rp += sprintf(rp, ">"); else if (htmlesc && *data == '&') rp += sprintf(rp, "&"); else if (stuffcr && *data == '\n') *rp++ = '\r', *rp++ = '\n'; else *rp++ = *data; data++, datalen--; } } } va_end(ap); if (pass == 0) { rp = ret = snewn(totallen+1, char); } else { assert(rp - ret == totallen); *rp = '\0'; } } return ret; } agedu-20211129.8cd63c5/agedu.h0000644000175000017500000000413114151034324014406 0ustar simonsimon/* * Central header file for agedu, defining various useful things. */ #include "cmake.h" #if HAVE_FEATURES_H #define _GNU_SOURCE #include #endif #if HAVE_STDIO_H # include #endif #if HAVE_STDBOOL_H # include #endif #if HAVE_ERRNO_H # include #endif #if HAVE_TIME_H # include #endif #if HAVE_ASSERT_H # include #endif #if HAVE_STRING_H # include #endif #if HAVE_STDLIB_H # include #endif #if HAVE_STDARG_H # include #endif #if HAVE_STDINT_H # include #endif #if HAVE_STDDEF_H # include #endif #if HAVE_LIMITS_H # include #endif #if HAVE_CTYPE_H # include #endif #if HAVE_SYS_TYPES_H # include #endif #if HAVE_SYS_STAT_H # include #endif #if HAVE_UNISTD_H # include #endif #if HAVE_FCNTL_H # include #endif #if HAVE_SYS_MMAN_H # include #endif #if HAVE_TERMIOS_H # include #endif #if HAVE_SYS_IOCTL_H # include #endif #if HAVE_FNMATCH_H # include #endif #if HAVE_PWD_H # include #endif #if HAVE_SYS_WAIT_H # include #endif #if HAVE_SYS_SOCKET_H # include #endif #if HAVE_ARPA_INET_H # include #endif #if HAVE_NETINET_IN_H # include #endif #if HAVE_SYSLOG_H # include #endif #if HAVE_SYS_SELECT_H # include #endif #if HAVE_NETDB_H # include #endif #ifndef HOST_NAME_MAX /* Reportedly at least one Solaris fails to comply with its POSIX * requirement to define this (see POSIX spec for gethostname) */ #define HOST_NAME_MAX 255 /* upper bound specified in SUS */ #endif #define PNAME "agedu" #define lenof(x) (sizeof((x))/sizeof(*(x))) extern char pathsep; #if HAVE_LSTAT64 && HAVE_STAT64 #define STRUCT_STAT struct stat64 #define LSTAT_FUNC lstat64 #define STAT_FUNC stat64 #else #define STRUCT_STAT struct stat #define LSTAT_FUNC lstat #define STAT_FUNC stat #endif #define max(x,y) ( (x) > (y) ? (x) : (y) ) #define min(x,y) ( (x) < (y) ? (x) : (y) ) agedu-20211129.8cd63c5/agedu.c0000644000175000017500000014375014151034324014414 0ustar simonsimon/* * Main program for agedu. */ #include "agedu.h" #include "du.h" #include "trie.h" #include "index.h" #include "alloc.h" #include "html.h" #include "httpd.h" #include "fgetline.h" #include "dumpfile.h" #include "version.h" /* * Path separator. This global variable affects the behaviour of * various parts of the code when they need to deal with path * separators. The path separator appropriate to a particular data * set is encoded in the index file storing that data set; data * sets generated on Unix will of course have the default '/', but * foreign data sets are conceivable and must be handled correctly. */ char pathsep = '/'; void fatal(const char *fmt, ...) { va_list ap; fprintf(stderr, "%s: ", PNAME); va_start(ap, fmt); vfprintf(stderr, fmt, ap); va_end(ap); fprintf(stderr, "\n"); exit(1); } typedef enum IncludeStatus { PRUNE, EXCLUDE, INCLUDE } IncludeStatus; struct inclusion_exclusion { IncludeStatus status; const char *wildcard; int path; }; struct ctx { triebuild *tb; dev_t datafile_dev, filesystem_dev; ino_t datafile_ino; time_t last_output_update; bool progress; int progwidth; bool straight_to_dump; struct inclusion_exclusion *inex; int ninex; bool crossfs; bool usemtime; bool uselogicalsize; bool fakeatimes; }; static dumpfile_write_state writestate; static void dump_line(const char *pathname, const struct trie_file *tf) { dumpfile_record dr; dr.pathname = pathname; dr.tf = *tf; if (!dump_write_record(&writestate, &dr)) fatal("standard output: %s", strerror(errno)); } static int gotdata(void *vctx, const char *pathname, const STRUCT_STAT *st) { struct ctx *ctx = (struct ctx *)vctx; struct trie_file file; time_t t; int i; IncludeStatus status; const char *filename; /* * Filter out our own data file. */ if (st->st_dev == ctx->datafile_dev && st->st_ino == ctx->datafile_ino) return 0; /* * Don't cross the streams^W^Wany file system boundary. */ if (!ctx->crossfs && st->st_dev != ctx->filesystem_dev) return 0; if (ctx->uselogicalsize) file.size = st->st_size; else file.size = (unsigned long long)512 * st->st_blocks; if (ctx->usemtime || (ctx->fakeatimes && S_ISDIR(st->st_mode))) file.atime = st->st_mtime; else file.atime = max(st->st_mtime, st->st_atime); /* * Filter based on wildcards. */ status = INCLUDE; filename = strrchr(pathname, pathsep); if (!filename) filename = pathname; else filename++; for (i = 0; i < ctx->ninex; i++) { if (fnmatch(ctx->inex[i].wildcard, ctx->inex[i].path ? pathname : filename, 0) == 0) status = ctx->inex[i].status; } if (status == PRUNE) return 0; /* ignore this entry and any subdirs */ if (status == EXCLUDE) { /* * Here we are supposed to be filtering an entry out, but * still recursing into it if it's a directory. However, * we can't actually leave out any directory whose * subdirectories we then look at. So we cheat, in that * case, by setting the size to zero. */ if (!S_ISDIR(st->st_mode)) return 0; /* just ignore */ else file.size = 0; } if (ctx->straight_to_dump) dump_line(pathname, &file); else triebuild_add(ctx->tb, pathname, &file); if (ctx->progress) { t = time(NULL); if (t != ctx->last_output_update) { fprintf(stderr, "%-*.*s\r", ctx->progwidth, ctx->progwidth, pathname); fflush(stderr); ctx->last_output_update = t; } } return 1; } static void scan_error(void *vctx, const char *fmt, ...) { struct ctx *ctx = (struct ctx *)vctx; va_list ap; if (ctx->progress) { fprintf(stderr, "%-*s\r", ctx->progwidth, ""); fflush(stderr); } fprintf(stderr, "%s: ", PNAME); va_start(ap, fmt); vfprintf(stderr, fmt, ap); va_end(ap); ctx->last_output_update--; /* force a progress report next time */ } static void text_query(const void *mappedfile, const char *querydir, time_t t, bool showfiles, int depth, FILE *fp) { size_t maxpathlen; char *pathbuf; unsigned long xi1, xi2; unsigned long long size; maxpathlen = trie_maxpathlen(mappedfile); pathbuf = snewn(maxpathlen + 1, char); /* * We want to query everything between the supplied filename * (inclusive) and that filename with a ^A on the end * (exclusive). So find the x indices for each. */ strcpy(pathbuf, querydir); make_successor(pathbuf); xi1 = trie_before(mappedfile, querydir); xi2 = trie_before(mappedfile, pathbuf); if (!showfiles && xi2 - xi1 == 1) return; /* file, or empty dir => no display */ /* * Now do the lookups in the age index. */ if (xi2 - xi1 == 1) { /* * We are querying an individual file, so we should not * depend on the index entries either side of the node, * since they almost certainly don't both exist. Instead, * just look up the file's size and atime in the main trie. */ const struct trie_file *f = trie_getfile(mappedfile, xi1); if (f->atime < t) size = f->size; else size = 0; } else { unsigned long long s1, s2; s1 = index_query(mappedfile, xi1, t); s2 = index_query(mappedfile, xi2, t); size = s2 - s1; } if (size == 0) return; /* no space taken up => no display */ if (depth != 0) { /* * Now scan for first-level subdirectories and report * those too. */ int newdepth = (depth > 0 ? depth - 1 : depth); xi1++; while (xi1 < xi2) { trie_getpath(mappedfile, xi1, pathbuf); text_query(mappedfile, pathbuf, t, showfiles, newdepth, fp); make_successor(pathbuf); xi1 = trie_before(mappedfile, pathbuf); } } /* Display in units of 1Kb */ fprintf(fp, "%-11llu %s\n", (size) / 1024, querydir); } /* * Largely frivolous way to define all my command-line options. I * present here a parametric macro which declares a series of * _logical_ option identifiers, and for each one declares zero or * more short option characters and zero or more long option * words. Then I repeatedly invoke that macro with its arguments * defined to be various other macros, which allows me to * variously: * * - define an enum allocating a distinct integer value to each * logical option id * - define a string consisting of precisely all the short option * characters * - define a string array consisting of all the long option * strings * - define (with help from auxiliary enums) integer arrays * parallel to both of the above giving the logical option id * for each physical short and long option * - define an array indexed by logical option id indicating * whether the option in question takes a value * - define a function which prints out brief online help for all * the options. * * It's not at all clear to me that this trickery is actually * particularly _efficient_ - it still, after all, requires going * linearly through the option list at run time and doing a * strcmp, whereas in an ideal world I'd have liked the lists of * long and short options to be pre-sorted so that a binary search * or some other more efficient lookup was possible. (Not that * asymptotic algorithmic complexity is remotely vital in option * parsing, but if I were doing this in, say, Lisp or something * with an equivalently powerful preprocessor then once I'd had * the idea of preparing the option-parsing data structures at * compile time I would probably have made the effort to prepare * them _properly_. I could have Perl generate me a source file * from some sort of description, I suppose, but that would seem * like overkill. And in any case, it's more of a challenge to * achieve as much as possible by cunning use of cpp and enum than * to just write some sensible and logical code in a Turing- * complete language. I said it was largely frivolous :-) * * This approach does have the virtue that it brings together the * option ids, option spellings and help text into a single * combined list and defines them all in exactly one place. If I * want to add a new option, or a new spelling for an option, I * only have to modify the main OPTHELP macro below and then add * code to process the new logical id. * * (Though, really, even that isn't ideal, since it still involves * modifying the source file in more than one place. In a * _properly_ ideal world, I'd be able to interleave the option * definitions with the code fragments that process them. And then * not bother defining logical identifiers for them at all - those * would be automatically generated, since I wouldn't have any * need to specify them manually in another part of the code.) * * One other helpful consequence of the enum-based structure here * is that it causes a compiler error if I accidentally try to * define the same option (short or long) twice. */ #define OPTHELP(NOVAL, VAL, SHORT, LONG, HELPPFX, HELPARG, HELPLINE, HELPOPT) \ HELPPFX("usage") HELPLINE(PNAME " [options] action [action...]") \ HELPPFX("actions") \ VAL(SCAN) SHORT(s) LONG(scan) \ HELPARG("directory") HELPOPT("scan and index a directory") \ NOVAL(HTTPD) SHORT(w) LONG(web) LONG(server) LONG(httpd) \ HELPOPT("serve HTML reports from a temporary web server") \ VAL(TEXT) SHORT(t) LONG(text) \ HELPARG("subdir") HELPOPT("print a plain text report on a subdirectory") \ NOVAL(REMOVE) SHORT(R) LONG(remove) LONG(delete) LONG(unlink) \ HELPOPT("remove the index file") \ NOVAL(DUMP) SHORT(D) LONG(dump) HELPOPT("dump the index file on stdout") \ NOVAL(LOAD) SHORT(L) LONG(load) \ HELPOPT("load and index a dump file") \ NOVAL(PRESORT) LONG(presort) \ HELPOPT("prepare a dump file for sorting") \ NOVAL(POSTSORT) LONG(postsort) \ HELPOPT("unprepare a dump file after sorting") \ VAL(SCANDUMP) SHORT(S) LONG(scan_dump) \ HELPARG("directory") HELPOPT("scan only, generating a dump") \ VAL(HTML) SHORT(H) LONG(html) \ HELPARG("subdir") HELPOPT("print an HTML report on a subdirectory") \ NOVAL(CGI) LONG(cgi) \ HELPOPT("do the right thing when run from a CGI script") \ HELPPFX("options") \ VAL(DATAFILE) SHORT(f) LONG(file) \ HELPARG("filename") HELPOPT("[most modes] specify index file") \ NOVAL(CROSSFS) LONG(cross_fs) \ HELPOPT("[--scan] cross filesystem boundaries") \ NOVAL(NOCROSSFS) LONG(no_cross_fs) \ HELPOPT("[--scan] stick to one filesystem") \ VAL(PRUNE) LONG(prune) \ HELPARG("wildcard") HELPOPT("[--scan] prune files matching pattern") \ VAL(PRUNEPATH) LONG(prune_path) \ HELPARG("wildcard") HELPOPT("[--scan] prune pathnames matching pattern") \ VAL(EXCLUDE) LONG(exclude) \ HELPARG("wildcard") HELPOPT("[--scan] exclude files matching pattern") \ VAL(EXCLUDEPATH) LONG(exclude_path) \ HELPARG("wildcard") HELPOPT("[--scan] exclude pathnames matching pattern") \ VAL(INCLUDE) LONG(include) \ HELPARG("wildcard") HELPOPT("[--scan] include files matching pattern") \ VAL(INCLUDEPATH) LONG(include_path) \ HELPARG("wildcard") HELPOPT("[--scan] include pathnames matching pattern") \ NOVAL(PROGRESS) LONG(progress) LONG(scan_progress) \ HELPOPT("[--scan] report progress on stderr") \ NOVAL(NOPROGRESS) LONG(no_progress) LONG(no_scan_progress) \ HELPOPT("[--scan] do not report progress") \ NOVAL(TTYPROGRESS) LONG(tty_progress) LONG(tty_scan_progress) \ LONG(progress_tty) LONG(scan_progress_tty) \ HELPOPT("[--scan] report progress if stderr is a tty") \ NOVAL(DIRATIME) LONG(dir_atime) LONG(dir_atimes) \ HELPOPT("[--scan,--load] keep real atimes on directories") \ NOVAL(NODIRATIME) LONG(no_dir_atime) LONG(no_dir_atimes) \ HELPOPT("[--scan,--load] fake atimes on directories") \ VAL(LAUNCH) LONG(launch) \ HELPARG("shell-cmd") HELPOPT("[--web] pass the base URL to the given command") \ NOVAL(NOEOF) LONG(no_eof) LONG(noeof) \ HELPOPT("[--web] do not close web server on EOF") \ NOVAL(MTIME) LONG(mtime) \ HELPOPT("[--scan] use mtime instead of atime") \ NOVAL(LOGICALSIZE) LONG(logicalsize) \ HELPOPT("[--logicalsize] use logical instead of physical file size") \ NOVAL(SHOWFILES) LONG(files) \ HELPOPT("[--web,--html,--text] list individual files") \ VAL(AGERANGE) SHORT(r) LONG(age_range) LONG(range) LONG(ages) \ HELPARG("age[-age]") HELPOPT("[--web,--html] set limits of colour coding") \ VAL(OUTFILE) SHORT(o) LONG(output) \ HELPARG("filename") HELPOPT("[--html] specify output file or directory name") \ NOVAL(NUMERIC) LONG(numeric) \ HELPOPT("[--html] name output files numerically") \ VAL(SERVERADDR) LONG(address) LONG(addr) LONG(server_address) \ LONG(server_addr) \ HELPARG("addr[:port]") HELPOPT("[--web] specify HTTP server address") \ VAL(AUTH) LONG(auth) LONG(http_auth) LONG(httpd_auth) \ LONG(server_auth) LONG(web_auth) \ HELPARG("type") HELPOPT("[--web] specify HTTP authentication method") \ VAL(AUTHFILE) LONG(auth_file) \ HELPARG("filename") HELPOPT("[--web] read HTTP Basic user/pass from file") \ VAL(AUTHFD) LONG(auth_fd) \ HELPARG("fd") HELPOPT("[--web] read HTTP Basic user/pass from fd") \ VAL(HTMLTITLE) LONG(title) \ HELPARG("title") HELPOPT("[--web,--html] title prefix for web pages") \ VAL(DEPTH) SHORT(d) LONG(depth) LONG(max_depth) LONG(maximum_depth) \ HELPARG("levels") HELPOPT("[--text,--html] recurse to this many levels") \ VAL(MINAGE) SHORT(a) LONG(age) LONG(min_age) LONG(minimum_age) \ HELPARG("age") HELPOPT("[--text] include only files older than this") \ HELPPFX("also") \ NOVAL(HELP) SHORT(h) LONG(help) HELPOPT("display this help text") \ NOVAL(VERSION) SHORT(V) LONG(version) HELPOPT("report version number") \ NOVAL(LICENCE) LONG(licence) LONG(license) \ HELPOPT("display (MIT) licence text") \ #define IGNORE(x) #define DEFENUM(x) OPT_ ## x, #define ZERO(x) 0, #define ONE(x) 1, #define STRING(x) #x , #define STRINGNOCOMMA(x) #x #define SHORTNEWOPT(x) SHORTtmp_ ## x = OPT_ ## x, #define SHORTTHISOPT(x) SHORTtmp2_ ## x, SHORTVAL_ ## x = SHORTtmp2_ ## x - 1, #define SHORTOPTVAL(x) SHORTVAL_ ## x, #define SHORTTMP(x) SHORTtmp3_ ## x, #define LONGNEWOPT(x) LONGtmp_ ## x = OPT_ ## x, #define LONGTHISOPT(x) LONGtmp2_ ## x, LONGVAL_ ## x = LONGtmp2_ ## x - 1, #define LONGOPTVAL(x) LONGVAL_ ## x, #define LONGTMP(x) SHORTtmp3_ ## x, #define OPTIONS(NOVAL, VAL, SHORT, LONG) \ OPTHELP(NOVAL, VAL, SHORT, LONG, IGNORE, IGNORE, IGNORE, IGNORE) enum { OPTIONS(DEFENUM,DEFENUM,IGNORE,IGNORE) NOPTIONS }; enum { OPTIONS(IGNORE,IGNORE,SHORTTMP,IGNORE) NSHORTOPTS }; enum { OPTIONS(IGNORE,IGNORE,IGNORE,LONGTMP) NLONGOPTS }; static const int opthasval[NOPTIONS] = {OPTIONS(ZERO,ONE,IGNORE,IGNORE)}; static const char shortopts[] = {OPTIONS(IGNORE,IGNORE,STRINGNOCOMMA,IGNORE)}; static const char *const longopts[] = {OPTIONS(IGNORE,IGNORE,IGNORE,STRING)}; enum { OPTIONS(SHORTNEWOPT,SHORTNEWOPT,SHORTTHISOPT,IGNORE) UNUSEDENUMVAL1 }; enum { OPTIONS(LONGNEWOPT,LONGNEWOPT,IGNORE,LONGTHISOPT) UNUSEDENUMVAL2 }; static const int shortvals[] = {OPTIONS(IGNORE,IGNORE,SHORTOPTVAL,IGNORE)}; static const int longvals[] = {OPTIONS(IGNORE,IGNORE,IGNORE,LONGOPTVAL)}; static void usage(FILE *fp) { char longbuf[80]; const char *prefix, *shortopt, *longopt, *optarg; int i, optex; #define HELPRESET prefix = shortopt = longopt = optarg = NULL, optex = -1 #define HELPNOVAL(s) optex = 0; #define HELPVAL(s) optex = 1; #define HELPSHORT(s) if (!shortopt) shortopt = "-" #s; #define HELPLONG(s) if (!longopt) { \ strcpy(longbuf, "--" #s); longopt = longbuf; \ for (i = 0; longbuf[i]; i++) if (longbuf[i] == '_') longbuf[i] = '-'; } #define HELPPFX(s) prefix = s; #define HELPARG(s) optarg = s; #define HELPLINE(s) assert(optex == -1); \ fprintf(fp, "%7s%c %s\n", prefix?prefix:"", prefix?':':' ', s); \ HELPRESET; #define HELPOPT(s) assert((optex == 1 && optarg) || (optex == 0 && !optarg)); \ assert(shortopt || longopt); \ i = fprintf(fp, "%7s%c %s%s%s%s%s", prefix?prefix:"", prefix?':':' ', \ shortopt?shortopt:"", shortopt&&longopt?", ":"", longopt?longopt:"", \ optarg?" ":"", optarg?optarg:""); \ fprintf(fp, "%*s %s\n", i<32?32-i:0,"",s); HELPRESET; HELPRESET; OPTHELP(HELPNOVAL, HELPVAL, HELPSHORT, HELPLONG, HELPPFX, HELPARG, HELPLINE, HELPOPT); #undef HELPRESET #undef HELPNOVAL #undef HELPVAL #undef HELPSHORT #undef HELPLONG #undef HELPPFX #undef HELPARG #undef HELPLINE #undef HELPOPT } static time_t parse_age(time_t now, const char *agestr) { time_t t; struct tm tm; int nunits; char unit[2]; t = now; if (2 != sscanf(agestr, "%d%1[DdWwMmYy]", &nunits, unit)) { fprintf(stderr, "%s: age specification should be a number followed by" " one of d,w,m,y\n", PNAME); exit(1); } if (unit[0] == 'd') { t -= 86400 * nunits; } else if (unit[0] == 'w') { t -= 86400 * 7 * nunits; } else { int ym; tm = *localtime(&t); ym = tm.tm_year * 12 + tm.tm_mon; if (unit[0] == 'm') ym -= nunits; else ym -= 12 * nunits; tm.tm_year = ym / 12; tm.tm_mon = ym % 12; t = mktime(&tm); } return t; } int main(int argc, char **argv) { int fd, count; struct ctx actx, *ctx = &actx; struct stat st; off_t totalsize, realsize; void *mappedfile; triewalk *tw; indexbuild *ib; const struct trie_file *tf, *prevtf; char *filename = PNAME ".dat"; bool doing_opts = true; enum { TEXT, HTML, SCAN, DUMP, SCANDUMP, LOAD, PRESORT, POSTSORT, HTTPD, REMOVE }; struct action { int mode; char *arg; } *actions = NULL; int nactions = 0, actionsize = 0, action; time_t now = time(NULL); time_t textcutoff = now, htmlnewest = now, htmloldest = now; bool htmlautoagerange = true; const char *httpserveraddr = "localhost"; const char *httpserverport = NULL; const char *httpauthdata = NULL; const char *outfile = NULL; const char *url_launch_command = NULL; bool numeric = false; const char *html_title = PNAME; int auth = HTTPD_AUTH_MAGIC | HTTPD_AUTH_BASIC; enum { NEVER, IFTTY, ALWAYS } progress = IFTTY; struct inclusion_exclusion *inex = NULL; int ninex = 0, inexsize = 0; bool crossfs = false; int depth = -1; bool gotdepth = false; bool fakediratimes = true; bool mtime = false; bool logicalsize = false; bool closeoneof = true; bool showfiles = false; #ifdef DEBUG_MAD_OPTION_PARSING_MACROS { static const char *const optnames[NOPTIONS] = { OPTIONS(STRING,STRING,IGNORE,IGNORE) }; int i; for (i = 0; i < NSHORTOPTS; i++) printf("-%c == %s [%s]\n", shortopts[i], optnames[shortvals[i]], opthasval[shortvals[i]] ? "value" : "no value"); for (i = 0; i < NLONGOPTS; i++) printf("--%s == %s [%s]\n", longopts[i], optnames[longvals[i]], opthasval[longvals[i]] ? "value" : "no value"); } #endif while (--argc > 0) { char *p = *++argv; if (doing_opts && *p == '-') { bool wordstart = true; if (!strcmp(p, "--")) { doing_opts = false; continue; } p++; while (*p) { int optid = -1; int i; char *optval; if (wordstart && *p == '-') { /* * GNU-style long option. */ p++; optval = strchr(p, '='); if (optval) *optval++ = '\0'; for (i = 0; i < NLONGOPTS; i++) { const char *opt = longopts[i], *s = p; bool match = true; /* * The underscores in the option names * defined above may be given by the user * as underscores or dashes, or omitted * entirely. */ while (*opt) { if (*opt == '_') { if (*s == '-' || *s == '_') s++; } else { if (*opt != *s) { match = false; break; } s++; } opt++; } if (match && !*s) { optid = longvals[i]; break; } } if (optid < 0) { fprintf(stderr, "%s: unrecognised option '--%s'\n", PNAME, p); return 1; } if (!opthasval[optid]) { if (optval) { fprintf(stderr, "%s: unexpected argument to option" " '--%s'\n", PNAME, p); return 1; } } else { if (!optval) { if (--argc > 0) { optval = *++argv; } else { fprintf(stderr, "%s: option '--%s' expects" " an argument\n", PNAME, p); return 1; } } } p += strlen(p); /* finished with this argument word */ } else { /* * Short option. */ char c = *p++; for (i = 0; i < NSHORTOPTS; i++) if (c == shortopts[i]) { optid = shortvals[i]; break; } if (optid < 0) { fprintf(stderr, "%s: unrecognised option '-%c'\n", PNAME, c); return 1; } if (opthasval[optid]) { if (*p) { optval = p; p += strlen(p); } else if (--argc > 0) { optval = *++argv; } else { fprintf(stderr, "%s: option '-%c' expects" " an argument\n", PNAME, c); return 1; } } else { optval = NULL; } } wordstart = false; /* * Now actually process the option. */ switch (optid) { case OPT_HELP: usage(stdout); return 0; case OPT_VERSION: #ifdef AGEDU_VERSION printf("%s, revision %s\n", PNAME, AGEDU_VERSION); #else printf("%s: version number not available\n", PNAME); #endif return 0; case OPT_LICENCE: { extern const char *const licence[]; int i; for (i = 0; licence[i]; i++) fputs(licence[i], stdout); } return 0; case OPT_SCAN: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = SCAN; actions[nactions].arg = optval; nactions++; break; case OPT_SCANDUMP: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = SCANDUMP; actions[nactions].arg = optval; nactions++; break; case OPT_DUMP: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = DUMP; actions[nactions].arg = NULL; nactions++; break; case OPT_LOAD: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = LOAD; actions[nactions].arg = NULL; nactions++; break; case OPT_PRESORT: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = PRESORT; actions[nactions].arg = NULL; nactions++; break; case OPT_POSTSORT: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = POSTSORT; actions[nactions].arg = NULL; nactions++; break; case OPT_TEXT: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = TEXT; actions[nactions].arg = optval; nactions++; break; case OPT_HTML: case OPT_CGI: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = HTML; actions[nactions].arg = (optid == OPT_HTML ? optval : NULL); nactions++; break; case OPT_HTTPD: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = HTTPD; actions[nactions].arg = NULL; nactions++; break; case OPT_REMOVE: if (nactions >= actionsize) { actionsize = nactions * 3 / 2 + 16; actions = sresize(actions, actionsize, struct action); } actions[nactions].mode = REMOVE; actions[nactions].arg = NULL; nactions++; break; case OPT_PROGRESS: progress = ALWAYS; break; case OPT_NOPROGRESS: progress = NEVER; break; case OPT_TTYPROGRESS: progress = IFTTY; break; case OPT_CROSSFS: crossfs = true; break; case OPT_NOCROSSFS: crossfs = false; break; case OPT_DIRATIME: fakediratimes = false; break; case OPT_NODIRATIME: fakediratimes = true; break; case OPT_SHOWFILES: showfiles = true; break; case OPT_MTIME: mtime = true; break; case OPT_LOGICALSIZE: logicalsize = true; break; case OPT_LAUNCH: url_launch_command = optval; break; case OPT_NOEOF: closeoneof = false; break; case OPT_DATAFILE: filename = optval; break; case OPT_DEPTH: if (!strcasecmp(optval, "unlimited") || !strcasecmp(optval, "infinity") || !strcasecmp(optval, "infinite") || !strcasecmp(optval, "inf") || !strcasecmp(optval, "maximum") || !strcasecmp(optval, "max")) depth = -1; else depth = atoi(optval); gotdepth = true; break; case OPT_OUTFILE: outfile = optval; break; case OPT_NUMERIC: numeric = true; break; case OPT_HTMLTITLE: html_title = optval; break; case OPT_MINAGE: textcutoff = parse_age(now, optval); break; case OPT_AGERANGE: if (!strcmp(optval, "auto")) { htmlautoagerange = true; } else { char *q = optval + strcspn(optval, "-:"); if (*q) *q++ = '\0'; htmloldest = parse_age(now, optval); htmlnewest = *q ? parse_age(now, q) : now; htmlautoagerange = false; } break; case OPT_SERVERADDR: { char *port; if (optval[0] == '[' && (port = strchr(optval, ']')) != NULL) { /* trim the [] from around the hostname */ *port++ = '\0'; optval++; } else { port = optval; } port += strcspn(port, ":"); if (port && *port) *port++ = '\0'; if (strchr(port, ':')) { /* * Multiple (non-square-bracket-protected) * colons have been seen in the --address * argument. That probably means someone * has used an unprotected IPv6 address * literal. */ fprintf(stderr, "%s: option '--address' expects a " "hostname, or hostname:port combination\n" "%s: if hostname is an IPv6 numeric " "address, enclose it in []\n", PNAME, PNAME); return 1; } if (!strcmp(optval, "ANY")) httpserveraddr = NULL; else httpserveraddr = optval; httpserverport = port; } break; case OPT_AUTH: if (!strcmp(optval, "magic")) auth = HTTPD_AUTH_MAGIC; else if (!strcmp(optval, "basic")) auth = HTTPD_AUTH_BASIC; else if (!strcmp(optval, "none")) auth = HTTPD_AUTH_NONE; else if (!strcmp(optval, "default")) auth = HTTPD_AUTH_MAGIC | HTTPD_AUTH_BASIC; else if (!strcmp(optval, "help") || !strcmp(optval, "list")) { printf(PNAME ": supported HTTP authentication types" " are:\n" " magic use Linux /proc/net/tcp to" " determine owner of peer socket\n" " basic HTTP Basic username and" " password authentication\n" " default use 'magic' if possible, " " otherwise fall back to 'basic'\n" " none unauthenticated HTTP (if" " the data file is non-confidential)\n"); return 0; } else { fprintf(stderr, "%s: unrecognised authentication" " type '%s'\n%*s options are 'magic'," " 'basic', 'none', 'default'\n", PNAME, optval, (int)strlen(PNAME), ""); return 1; } break; case OPT_AUTHFILE: case OPT_AUTHFD: { int fd; char namebuf[40]; const char *name; char *authbuf; int authlen, authsize; int ret; if (optid == OPT_AUTHFILE) { fd = open(optval, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, optval, strerror(errno)); return 1; } name = optval; } else { fd = atoi(optval); name = namebuf; sprintf(namebuf, "fd %d", fd); } authlen = 0; authsize = 256; authbuf = snewn(authsize, char); while ((ret = read(fd, authbuf+authlen, authsize-authlen)) > 0) { authlen += ret; if ((authsize - authlen) < (authsize / 16)) { authsize = authlen * 3 / 2 + 4096; authbuf = sresize(authbuf, authsize, char); } } if (ret < 0) { fprintf(stderr, "%s: %s: read: %s\n", PNAME, name, strerror(errno)); return 1; } if (optid == OPT_AUTHFILE) close(fd); httpauthdata = authbuf; } break; case OPT_INCLUDE: case OPT_INCLUDEPATH: case OPT_EXCLUDE: case OPT_EXCLUDEPATH: case OPT_PRUNE: case OPT_PRUNEPATH: if (ninex >= inexsize) { inexsize = ninex * 3 / 2 + 16; inex = sresize(inex, inexsize, struct inclusion_exclusion); } inex[ninex].path = (optid == OPT_INCLUDEPATH || optid == OPT_EXCLUDEPATH || optid == OPT_PRUNEPATH); inex[ninex].status = (optid == OPT_INCLUDE ? INCLUDE : optid == OPT_INCLUDEPATH ? INCLUDE : optid == OPT_EXCLUDE ? EXCLUDE : optid == OPT_EXCLUDEPATH ? EXCLUDE : optid == OPT_PRUNE ? PRUNE : /* optid == OPT_PRUNEPATH ? */ PRUNE); inex[ninex].wildcard = optval; ninex++; break; } } } else { fprintf(stderr, "%s: unexpected argument '%s'\n", PNAME, p); return 1; } } if (nactions == 0) { usage(stderr); return 1; } for (action = 0; action < nactions; action++) { int mode = actions[action].mode; if (mode == SCAN || mode == SCANDUMP || mode == LOAD) { const char *scandir = actions[action].arg; dumpfile_load_state *dls; if (mode == LOAD) { dls = dumpfile_load_init(stdin, true); if (!dls) return 1; pathsep = dumpfile_load_get_pathsep(dls); } if (mode == SCAN || mode == LOAD) { /* * Prepare to write out the index file. */ fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, S_IRUSR | S_IWUSR); if (fd < 0) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, filename, strerror(errno)); return 1; } if (fstat(fd, &st) < 0) { perror(PNAME ": fstat"); return 1; } ctx->datafile_dev = st.st_dev; ctx->datafile_ino = st.st_ino; ctx->straight_to_dump = false; } else { ctx->datafile_dev = -1; ctx->datafile_ino = -1; ctx->straight_to_dump = true; } if (mode == SCAN || mode == SCANDUMP) { if (stat(scandir, &st) < 0) { fprintf(stderr, "%s: %s: stat: %s\n", PNAME, scandir, strerror(errno)); return 1; } ctx->filesystem_dev = crossfs ? 0 : st.st_dev; } ctx->inex = inex; ctx->ninex = ninex; ctx->crossfs = crossfs; ctx->fakeatimes = fakediratimes; ctx->usemtime = mtime; ctx->uselogicalsize = logicalsize; ctx->last_output_update = time(NULL); if (progress == IFTTY) ctx->progress = isatty(2); else ctx->progress = (progress == ALWAYS); { struct winsize ws; if (ctx->progress && ioctl(2, TIOCGWINSZ, &ws) == 0 && ws.ws_col > 0) ctx->progwidth = ws.ws_col - 1; else ctx->progwidth = 79; } if (mode == SCANDUMP) { writestate.fp = stdout; writestate.sortable = false; writestate.pathsep = pathsep; if (!dump_write_header(&writestate)) fatal("standard output: %s", strerror(errno)); } /* * Scan the directory tree, and write out the trie component * of the data file. */ if (mode != SCANDUMP) { ctx->tb = triebuild_new(fd); } if (mode == LOAD) { dumpfile_record dr; int retd; while ((retd = dumpfile_load_record(dls, &dr)) != 0) { if (retd < 0) { remove(filename); return 1; } triebuild_add(ctx->tb, dr.pathname, &dr.tf); } dumpfile_load_finish(dls); } else { du(scandir, gotdata, scan_error, ctx); } if (mode != SCANDUMP) { size_t maxpathlen; size_t delta; char *buf, *prevbuf; count = triebuild_finish(ctx->tb); triebuild_free(ctx->tb); if (ctx->progress) { fprintf(stderr, "%-*s\r", ctx->progwidth, ""); fflush(stderr); } /* * Work out how much space the cumulative index trees * will take; enlarge the file, and memory-map it. */ if (fstat(fd, &st) < 0) { perror(PNAME ": fstat"); return 1; } printf("Built pathname index, %d entries," " %llu bytes of index\n", count, (unsigned long long)st.st_size); totalsize = index_initial_size(st.st_size, count); totalsize += totalsize / 10; if (lseek(fd, totalsize-1, SEEK_SET) < 0) { perror(PNAME ": lseek"); return 1; } if (write(fd, "\0", 1) < 1) { perror(PNAME ": write"); return 1; } mappedfile = mmap(NULL, totalsize, PROT_READ|PROT_WRITE,MAP_SHARED, fd, 0); if (mappedfile == MAP_FAILED) { perror(PNAME ": mmap"); return 1; } if (fakediratimes) { printf("Faking directory atimes\n"); trie_fake_dir_atimes(mappedfile); } printf("Building index\n"); ib = indexbuild_new(mappedfile, st.st_size, count, &delta); maxpathlen = trie_maxpathlen(mappedfile); buf = snewn(maxpathlen, char); prevbuf = snewn(maxpathlen, char); tw = triewalk_new(mappedfile); prevbuf[0] = '\0'; tf = triewalk_next(tw, buf); assert(tf); while (1) { int i; if (totalsize - indexbuild_realsize(ib) < delta) { const void *oldfile = mappedfile; ptrdiff_t diff; /* * Unmap the file, grow it, and remap it. */ munmap(mappedfile, totalsize); totalsize += delta; totalsize += totalsize / 10; if (lseek(fd, totalsize-1, SEEK_SET) < 0) { perror(PNAME ": lseek"); return 1; } if (write(fd, "\0", 1) < 1) { perror(PNAME ": write"); return 1; } mappedfile = mmap(NULL, totalsize, PROT_READ|PROT_WRITE,MAP_SHARED, fd, 0); if (mappedfile == MAP_FAILED) { perror(PNAME ": mmap"); return 1; } indexbuild_rebase(ib, mappedfile); triewalk_rebase(tw, mappedfile); diff = (const unsigned char *)mappedfile - (const unsigned char *)oldfile; if (tf) tf = (const struct trie_file *) (((const unsigned char *)tf) + diff); } /* * Get the next file from the index. So we are * currently holding, and have not yet * indexed, prevtf (with pathname prevbuf) and * tf (with pathname buf). */ prevtf = tf; memcpy(prevbuf, buf, maxpathlen); tf = triewalk_next(tw, buf); if (!tf) buf[0] = '\0'; /* * Find the first differing character position * between our two pathnames. */ for (i = 0; prevbuf[i] && prevbuf[i] == buf[i]; i++); /* * If prevbuf was a directory name and buf is * something inside that directory, then * trie_before() will be called on prevbuf * itself. Hence we must drop a tag before it, * so that the resulting index is usable. */ if ((!prevbuf[i] && (buf[i] == pathsep || (i > 0 && buf[i-1] == pathsep)))) indexbuild_tag(ib); /* * Add prevtf to the index. */ indexbuild_add(ib, prevtf); if (!tf) { /* * Drop an unconditional final tag, and * get out of this loop. */ indexbuild_tag(ib); break; } /* * If prevbuf was a filename inside some * directory which buf is outside, then * trie_before() will be called on some * pathname either equal to buf or epsilon * less than it. Either way, we're going to * need to drop a tag after prevtf. */ if (strchr(prevbuf+i, pathsep) || !tf) indexbuild_tag(ib); } triewalk_free(tw); realsize = indexbuild_realsize(ib); indexbuild_free(ib); munmap(mappedfile, totalsize); if (ftruncate(fd, realsize) < 0) fatal("%s: truncate: %s\n", filename, strerror(errno)); close(fd); printf("Final index file size = %llu bytes\n", (unsigned long long)realsize); } } else if (mode == TEXT) { char *querydir = actions[action].arg; size_t pathlen; fd = open(filename, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, filename, strerror(errno)); return 1; } if (fstat(fd, &st) < 0) { perror(PNAME ": fstat"); return 1; } totalsize = st.st_size; mappedfile = mmap(NULL, totalsize, PROT_READ, MAP_SHARED, fd, 0); if (mappedfile == MAP_FAILED) { perror(PNAME ": mmap"); return 1; } if (!trie_check_magic(mappedfile)) { fprintf(stderr, "%s: %s: magic numbers did not match\n" "%s: check that the index was built by this version of agedu on this platform\n", PNAME, filename, PNAME); return 1; } pathsep = trie_pathsep(mappedfile); /* * Trim trailing slash, just in case. */ pathlen = strlen(querydir); if (pathlen > 0 && querydir[pathlen-1] == pathsep) querydir[--pathlen] = '\0'; if (!gotdepth) depth = 1; /* default for text mode */ if (outfile != NULL) { FILE *fp = fopen(outfile, "w"); if (!fp) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, outfile, strerror(errno)); return 1; } text_query(mappedfile, querydir, textcutoff, showfiles, depth, fp); fclose(fp); } else { text_query(mappedfile, querydir, textcutoff, showfiles, depth, stdout); } munmap(mappedfile, totalsize); } else if (mode == HTML) { char *querydir = actions[action].arg; size_t pathlen, maxpathlen; char *pathbuf; struct html_config cfg; unsigned long xi; char *html; fd = open(filename, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, filename, strerror(errno)); if (!querydir) { printf("Status: 500\nContent-type: text/html\n\n" "" "500 Internal Server Error" "" "

500 Internal Server Error

" "

agedu suffered an internal error." "\n"); return 0; } return 1; } if (fstat(fd, &st) < 0) { fprintf(stderr, "%s: %s: fstat: %s\n", PNAME, filename, strerror(errno)); if (!querydir) { printf("Status: 500\nContent-type: text/html\n\n" "" "500 Internal Server Error" "" "

500 Internal Server Error

" "

agedu suffered an internal error." "\n"); return 0; } return 1; } totalsize = st.st_size; mappedfile = mmap(NULL, totalsize, PROT_READ, MAP_SHARED, fd, 0); if (mappedfile == MAP_FAILED) { fprintf(stderr, "%s: %s: mmap: %s\n", PNAME, filename, strerror(errno)); if (!querydir) { printf("Status: 500\nContent-type: text/html\n\n" "" "500 Internal Server Error" "" "

500 Internal Server Error

" "

agedu suffered an internal error." "\n"); return 0; } return 1; } if (!trie_check_magic(mappedfile)) { fprintf(stderr, "%s: %s: magic numbers did not match\n" "%s: check that the index was built by this version of agedu on this platform\n", PNAME, filename, PNAME); if (!querydir) { printf("Status: 500\nContent-type: text/html\n\n" "" "500 Internal Server Error" "" "

500 Internal Server Error

" "

agedu suffered an internal error." "\n"); return 0; } return 1; } pathsep = trie_pathsep(mappedfile); maxpathlen = trie_maxpathlen(mappedfile); pathbuf = snewn(maxpathlen, char); if (!querydir || !gotdepth) { /* * Single output file. */ if (!querydir) { cfg.uriformat = "/%|/%p/%|%|/%p"; } else { cfg.uriformat = NULL; } cfg.autoage = htmlautoagerange; cfg.oldest = htmloldest; cfg.newest = htmlnewest; cfg.showfiles = showfiles; } else { if (!numeric) { cfg.uriformat = "/index.html%|/%/p.html"; cfg.fileformat = "/index.html%|/%/p.html"; } else { cfg.uriformat = "/index.html%|/%n.html"; cfg.fileformat = "/index.html%|/%n.html"; } cfg.autoage = htmlautoagerange; cfg.oldest = htmloldest; cfg.newest = htmlnewest; cfg.showfiles = showfiles; } cfg.html_title = html_title; if (!querydir) { /* * If we're run in --cgi mode, read PATH_INFO to get * a numeric pathname index. */ char *path_info = getenv("PATH_INFO"); if (!path_info) path_info = ""; /* * Parse the path. */ if (!html_parse_path(mappedfile, path_info, &cfg, &xi)) { printf("Status: 404\nContent-type: text/html\n\n" "" "404 Not Found" "" "

400 Not Found

" "

Invalid agedu pathname." "\n"); return 0; } /* * If the path was parseable but not canonically * expressed, return a redirect to the canonical * version. */ char *canonpath = html_format_path(mappedfile, &cfg, xi); if (strcmp(canonpath, path_info)) { char *servername = getenv("SERVER_NAME"); char *scriptname = getenv("SCRIPT_NAME"); if (!servername || !scriptname) { if (servername) fprintf(stderr, "%s: SCRIPT_NAME unset\n", PNAME); else if (scriptname) fprintf(stderr, "%s: SCRIPT_NAME unset\n", PNAME); else fprintf(stderr, "%s: SERVER_NAME and " "SCRIPT_NAME both unset\n", PNAME); printf("Status: 500\nContent-type: text/html\n\n" "" "500 Internal Server Error" "" "

500 Internal Server Error

" "

agedu suffered an internal " "error." "\n"); return 0; } printf("Status: 301\n" "Location: http://%s/%s%s\n" "Content-type: text/html\n\n" "" "301 Moved" "" "

301 Moved

" "

Moved." "\n", servername, scriptname, canonpath); return 0; } } else { /* * In ordinary --html mode, process a query * directory passed in on the command line. */ /* * Trim trailing slash, just in case. * * (Note that we do this if pathlen > 1, not if * pathlen > 0. That is, the one case of a trailing * slash that we leave intact is the case where it's * the whole string because the query directory is * just "/".) */ pathlen = strlen(querydir); if (pathlen > 1 && querydir[pathlen-1] == pathsep) querydir[--pathlen] = '\0'; xi = trie_before(mappedfile, querydir); if (xi >= trie_count(mappedfile) || (trie_getpath(mappedfile, xi, pathbuf), strcmp(pathbuf, querydir))) { fprintf(stderr, "%s: pathname '%s' does not exist in index\n" "%*s(check it is spelled exactly as it is in the " "index, including\n%*sany leading './')\n", PNAME, querydir, (int)(1+sizeof(PNAME)), "", (int)(1+sizeof(PNAME)), ""); return 1; } else if (!index_has_root(mappedfile, xi)) { fprintf(stderr, "%s: pathname '%s' is" " a file, not a directory\n", PNAME, querydir); return 1; } } if (!querydir || !gotdepth) { /* * Single output file. */ html = html_query(mappedfile, xi, &cfg, true); if (querydir && outfile != NULL) { FILE *fp = fopen(outfile, "w"); if (!fp) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, outfile, strerror(errno)); return 1; } else if (fputs(html, fp) < 0) { fprintf(stderr, "%s: %s: write: %s\n", PNAME, outfile, strerror(errno)); fclose(fp); return 1; } else if (fclose(fp) < 0) { fprintf(stderr, "%s: %s: fclose: %s\n", PNAME, outfile, strerror(errno)); return 1; } } else { if (!querydir) { printf("Content-type: text/html\n\n"); } fputs(html, stdout); } } else { /* * Multiple output files. */ int dirlen = outfile ? 2+strlen(outfile) : 3; char prefix[dirlen]; if (outfile) { if (mkdir(outfile, 0777) < 0 && errno != EEXIST) { fprintf(stderr, "%s: %s: mkdir: %s\n", PNAME, outfile, strerror(errno)); return 1; } snprintf(prefix, dirlen, "%s/", outfile); } else snprintf(prefix, dirlen, "./"); unsigned long xi2; /* * pathbuf is only set up in the plain-HTML case and * not in the CGI case; but that's OK, because the * CGI case can't come to this branch of the if * anyway. */ make_successor(pathbuf); xi2 = trie_before(mappedfile, pathbuf); if (html_dump(mappedfile, xi, xi2, depth, &cfg, prefix)) return 1; } munmap(mappedfile, totalsize); sfree(pathbuf); } else if (mode == DUMP) { size_t maxpathlen; char *buf; fd = open(filename, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, filename, strerror(errno)); return 1; } if (fstat(fd, &st) < 0) { perror(PNAME ": fstat"); return 1; } totalsize = st.st_size; mappedfile = mmap(NULL, totalsize, PROT_READ, MAP_SHARED, fd, 0); if (mappedfile == MAP_FAILED) { perror(PNAME ": mmap"); return 1; } if (!trie_check_magic(mappedfile)) { fprintf(stderr, "%s: %s: magic numbers did not match\n" "%s: check that the index was built by this version of agedu on this platform\n", PNAME, filename, PNAME); return 1; } pathsep = trie_pathsep(mappedfile); maxpathlen = trie_maxpathlen(mappedfile); buf = snewn(maxpathlen, char); writestate.fp = stdout; writestate.sortable = false; writestate.pathsep = pathsep; if (!dump_write_header(&writestate)) fatal("standard output: %s", strerror(errno)); tw = triewalk_new(mappedfile); while ((tf = triewalk_next(tw, buf)) != NULL) dump_line(buf, tf); triewalk_free(tw); munmap(mappedfile, totalsize); } else if (mode == HTTPD) { struct html_config pcfg; struct httpd_config dcfg; fd = open(filename, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: %s: open: %s\n", PNAME, filename, strerror(errno)); return 1; } if (fstat(fd, &st) < 0) { perror(PNAME ": fstat"); return 1; } totalsize = st.st_size; mappedfile = mmap(NULL, totalsize, PROT_READ, MAP_SHARED, fd, 0); if (mappedfile == MAP_FAILED) { perror(PNAME ": mmap"); return 1; } if (!trie_check_magic(mappedfile)) { fprintf(stderr, "%s: %s: magic numbers did not match\n" "%s: check that the index was built by this version of agedu on this platform\n", PNAME, filename, PNAME); return 1; } pathsep = trie_pathsep(mappedfile); dcfg.address = httpserveraddr; dcfg.port = httpserverport; dcfg.closeoneof = closeoneof; dcfg.basicauthdata = httpauthdata; dcfg.url_launch_command = url_launch_command; pcfg.uriformat = "/%|/%p/%|%|/%p"; pcfg.autoage = htmlautoagerange; pcfg.oldest = htmloldest; pcfg.newest = htmlnewest; pcfg.showfiles = showfiles; pcfg.html_title = html_title; run_httpd(mappedfile, auth, &dcfg, &pcfg); munmap(mappedfile, totalsize); } else if (mode == REMOVE) { if (remove(filename) < 0) { fprintf(stderr, "%s: %s: remove: %s\n", PNAME, filename, strerror(errno)); return 1; } } else if (mode == PRESORT || mode == POSTSORT) { dumpfile_load_state *dls; dumpfile_write_state dws; dls = dumpfile_load_init(stdin, false); if (!dls) return 1; dws.fp = stdout; dws.pathsep = dumpfile_load_get_pathsep(dls); dws.sortable = (mode == PRESORT); if (!dump_write_header(&dws)) fatal("standard output: %s", strerror(errno)); dumpfile_record dr; int retd; while ((retd = dumpfile_load_record(dls, &dr)) != 0) { if (retd < 0) return 1; if (!dump_write_record(&dws, &dr)) fatal("standard output: %s", strerror(errno)); } dumpfile_load_finish(dls); } } return 0; } agedu-20211129.8cd63c5/agedu.but0000644000175000017500000011100414151034324014747 0ustar simonsimon\cfg{man-identity}{agedu}{1}{2008-11-02}{Simon Tatham}{Simon Tatham} \define{dash} \u2013{-} \title Man page for \cw{agedu} \U NAME \cw{agedu} \dash correlate disk usage with last-access times to identify large and disused data \U SYNOPSIS \c agedu [ options ] action [action...] \e bbbbb iiiiiii iiiiii iiiiii \U DESCRIPTION \cw{agedu} scans a directory tree and produces reports about how much disk space is used in each directory and subdirectory, and also how that usage of disk space corresponds to files with last-access times a long time ago. In other words, \cw{agedu} is a tool you might use to help you free up disk space. It lets you see which directories are taking up the most space, as \cw{du} does; but unlike \cw{du}, it also distinguishes between large collections of data which are still in use and ones which have not been accessed in months or years \dash for instance, large archives downloaded, unpacked, used once, and never cleaned up. Where \cw{du} helps you find what's using your disk space, \cw{agedu} helps you find what's \e{wasting} your disk space. \cw{agedu} has several operating modes. In one mode, it scans your disk and builds an index file containing a data structure which allows it to efficiently retrieve any information it might need. Typically, you would use it in this mode first, and then run it in one of a number of \q{query} modes to display a report of the disk space usage of a particular directory and its subdirectories. Those reports can be produced as plain text (much like \cw{du}) or as HTML. \cw{agedu} can even run as a miniature web server, presenting each directory's HTML report with hyperlinks to let you navigate around the file system to similar reports for other directories. So you would typically start using \cw{agedu} by telling it to do a scan of a directory tree and build an index. This is done with a command such as \c $ agedu -s /home/fred \e bbbbbbbbbbbbbbbbbbb which will build a large data file called \c{agedu.dat} in your current directory. (If that current directory is \e{inside} \cw{/home/fred}, don't worry \dash \cw{agedu} is smart enough to discount its own index file.) Having built the index, you would now query it for reports of disk space usage. If you have a graphical web browser, the simplest and nicest way to query the index is by running \cw{agedu} in web server mode: \c $ agedu -w \e bbbbbbbb which will print (among other messages) a URL on its standard output along the lines of \c URL: http://127.0.0.1:48638/ (That URL will always begin with \cq{127.}, meaning that it's in the \cw{localhost} address space. So only processes running on the same computer can even try to connect to that web server, and also there is access control to prevent other users from seeing it \dash see below for more detail.) Now paste that URL into your web browser, and you will be shown a graphical representation of the disk usage in \cw{/home/fred} and its immediate subdirectories, with varying colours used to show the difference between disused and recently-accessed data. Click on any subdirectory to descend into it and see a report for its subdirectories in turn; click on parts of the pathname at the top of any page to return to higher-level directories. When you've finished browsing, you can just press Ctrl-D to send an end-of-file indication to \cw{agedu}, and it will shut down. After that, you probably want to delete the data file \cw{agedu.dat}, since it's pretty large. In fact, the command \cw{agedu -R} will do this for you; and you can chain \cw{agedu} commands on the same command line, so that instead of the above you could have done \c $ agedu -s /home/fred -w -R \e bbbbbbbbbbbbbbbbbbbbbbbbb for a single self-contained run of \cw{agedu} which builds its index, serves web pages from it, and cleans it up when finished. In some situations, you might want to scan the directory structure of one computer, but run \cw{agedu}'s user interface on another. In that case, you can do your scan using the \cw{agedu -S} option in place of \cw{agedu -s}, which will make \cw{agedu} not bother building an index file but instead just write out its scan results in plain text on standard output; then you can funnel that output to the other machine using SSH (or whatever other technique you prefer), and there, run \cw{agedu -L} to load in the textual dump and turn it into an index file. For example, you might run a command like this (plus any \cw{ssh} options you need) on the machine you want to scan: \c $ agedu -S /home/fred | ssh indexing-machine agedu -L \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb or, equivalently, run something like this on the other machine: \c $ ssh machine-to-scan agedu -S /home/fred | agedu -L \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb Either way, the \cw{agedu -L} command will create an \cw{agedu.dat} index file, which you can then use with \cw{agedu -w} just as above. (Another way to do this might be to build the index file on the first machine as normal, and then just copy it to the other machine once it's complete. However, for efficiency, the index file is formatted differently depending on the CPU architecture that \cw{agedu} is compiled for. So if that doesn't match between the two machines \dash e.g. if one is a 32-bit machine and one 64-bit \dash then \cw{agedu.dat} files written on one machine will not work on the other. The technique described above using \cw{-S} and \cw{-L} should work between any two machines.) If you don't have a graphical web browser, you can do text-based queries instead of using \cw{agedu}'s web interface. Having scanned \cw{/home/fred} in any of the ways suggested above, you might run \c $ agedu -t /home/fred \e bbbbbbbbbbbbbbbbbbb which again gives a summary of the disk usage in \cw{/home/fred} and its immediate subdirectories; but this time \cw{agedu} will print it on standard output, in much the same format as \cw{du}. If you then want to find out how much \e{old} data is there, you can add the \cw{-a} option to show only files last accessed a certain length of time ago. For example, to show only files which haven't been looked at in six months or more: \c $ agedu -t /home/fred -a 6m \e bbbbbbbbbbbbbbbbbbbbbbbbb That's the essence of what \cw{agedu} does. It has other modes of operation for more complex situations, and the usual array of configurable options. The following sections contain a complete reference for all its functionality. \U OPERATING MODES This section describes the operating modes supported by \cw{agedu}. Each of these is in the form of a command-line option, sometimes with an argument. Multiple operating-mode options may appear on the command line, in which case \cw{agedu} will perform the specified actions one after another. For instance, as shown in the previous section, you might want to perform a disk scan and immediately launch a web server giving reports from that scan. \dt \cw{-s} \e{directory} or \cw{--scan} \e{directory} \dd In this mode, \cw{agedu} scans the file system starting at the specified directory, and indexes the results of the scan into a large data file which other operating modes can query. \lcont{ By default, the scan is restricted to a single file system (since the expected use of \cw{agedu} is that you would probably use it because a particular disk partition was running low on space). You can remove that restriction using the \cw{--cross-fs} option; other configuration options allow you to include or exclude files or entire subdirectories from the scan. See the next section for full details of the configurable options. The index file is created with restrictive permissions, in case the file system you are scanning contains confidential information in its structure. Index files are dependent on the characteristics of the CPU architecture you created them on. You should not expect to be able to move an index file between different types of computer and have it continue to work. If you need to transfer the results of a disk scan to a different kind of computer, see the \cw{-D} and \cw{-L} options below. } \dt \cw{-w} or \cw{--web} \dd In this mode, \cw{agedu} expects to find an index file already written. It allocates a network port, and starts up a web server on that port which serves reports generated from the index file. By default it invents its own URL and prints it out. \lcont{ The web server runs until \cw{agedu} receives an end-of-file event on its standard input. (The expected usage is that you run it from the command line, immediately browse web pages until you're satisfied, and then press Ctrl-D.) To disable the EOF behaviour, use the \cw{--no-eof} option. In case the index file contains any confidential information about your file system, the web server protects the pages it serves from access by other people. On Linux, this is done transparently by means of using \cw{/proc/net/tcp} to check the owner of each incoming connection; failing that, the web server will require a password to view the reports, and \cw{agedu} will print the password it invented on standard output along with the URL. Configurable options for this mode let you specify your own address and port number to listen on, and also specify your own choice of authentication method (including turning authentication off completely) and a username and password of your choice. } \dt \cw{-t} \e{directory} or \cw{--text} \e{directory} \dd In this mode, \cw{agedu} generates a textual report on standard output, listing the disk usage in the specified directory and all its subdirectories down to a given depth. By default that depth is 1, so that you see a report for \e{directory} itself and all of its immediate subdirectories. You can configure a different depth (or no depth limit) using \cw{-d}, described in the next section. \lcont{ Used on its own, \cw{-t} merely lists the \e{total} disk usage in each subdirectory; \cw{agedu}'s additional ability to distinguish unused from recently-used data is not activated. To activate it, use the \cw{-a} option to specify a minimum age. The directory structure stored in \cw{agedu}'s index file is treated as a set of literal strings. This means that you cannot refer to directories by synonyms. So if you ran \cw{agedu -s .}, then all the path names you later pass to the \cw{-t} option must be either \cq{.} or begin with \cq{./}. Similarly, symbolic links within the directory you scanned will not be followed; you must refer to each directory by its canonical, symlink-free pathname. } \dt \cw{-R} or \cw{--remove} \dd In this mode, \cw{agedu} deletes its index file. Running just \cw{agedu -R} on its own is therefore equivalent to typing \cw{rm agedu.dat}. However, you can also put \cw{-R} on the end of a command line to indicate that \cw{agedu} should delete its index file after it finishes performing other operations. \dt \cw{-S} \e{directory} or \cw{--scan-dump} \e{directory} \dd In this mode, \cw{agedu} will scan a directory tree and convert the results straight into a textual dump on standard output, without generating an index file at all. The dump data is intended for \cw{agedu -L} to read. \dt \cw{-L} or \cw{--load} \dd In this mode, \cw{agedu} expects to read a dump produced by the \cw{-S} option from its standard input. It constructs an index file from that dump, exactly as it would have if it had read the same data from a disk scan in \cw{-s} mode. \dt \cw{-D} or \cw{--dump} \dd In this mode, \cw{agedu} reads an existing index file and produces a dump of its contents on standard output, in the same format used by \cw{-S} and \cw{-L}. This option could be used to convert an existing index file into a format acceptable to a different kind of computer, by dumping it using \cw{-D} and then loading the dump back in on the other machine using \cw{-L}. \lcont{ (The output of \cw{agedu -D} on an existing index file will not be exactly \e{identical} to what \cw{agedu -S} would have originally produced, due to a difference in treatment of last-access times on directories. However, it should be effectively equivalent for most purposes. See the documentation of the \cw{--dir-atime} option in the next section for further detail.) } \dt \cw{-H} \e{directory} or \cw{--html} \e{directory} \dd In this mode, \cw{agedu} will generate an HTML report of the disk usage in the specified directory and its immediate subdirectories, in the same form that it serves from its web server in \cw{-w} mode. \lcont{ By default, a single HTML report will be generated and simply written to standard output, with no hyperlinks pointing to other similar pages. If you also specify the \cw{-d} option (see below), \cw{agedu} will instead write out a collection of HTML files with hyperlinks between them, and call the top-level file \cw{index.html}. } \dt \cw{--cgi} \dd In this mode, \cw{agedu} will run as the bulk of a CGI script which provides the same set of web pages as the built-in web server would. It will read the usual CGI environment variables, and write CGI-style data to its standard output. \lcont{ The actual CGI program itself should be a tiny wrapper around \cw{agedu} which passes it the \cw{--cgi} option, and also (probably) \cw{-f} to locate the index file. \cw{agedu} will do everything else. For example, your script might read \c #!/bin/sh \c /some/path/to/agedu --cgi -f /some/other/path/to/agedu.dat \e iiiiiiiiiiiiii iiiiiiiiiiiiiiiiiiii (Note that \cw{agedu} will produce the \e{entire} CGI output, including status code, HTTP headers and the full HTML document. If you try to surround the call to \cw{agedu --cgi} with code that adds your own HTML header and footer, you won't get the results you want, and \cw{agedu}'s HTTP-level features such as auto-redirecting to canonical versions of URIs will stop working.) No access control is performed in this mode: restricting access to CGI scripts is assumed to be the job of the web server. } \dt \cw{--presort} and \cw{--postsort} \dd In these two modes, \cw{agedu} will expect to read a textual data dump from its standard input of the form produced by \cw{-S} (and \c{-D}). It will transform the data into a different version of its text dump format, and write the transformed version on standard output. \lcont{ The ordinary dump file format is reasonably readable, but loading it into an index file using \cw{agedu -L} requires it to be sorted in a specific order, which is complicated to describe and difficult to implement using ordinary Unix sorting tools. So if you want to construct your own data dump from a source of your own that \cw{agedu} itself doesn't know how to scan, you will need to make sure it's sorted in the right order. To help with this, \cw{agedu} provides a secondary dump format which is \q{sortable}, in the sense that ordinary \cw{sort}(\e{1}) without arguments will arrange it into the right order. However, the sortable format is much more unreadable and also twice the size, so you wouldn't want to write it directly! So the recommended procedure is to generate dump data in the ordinary format; then pipe it through \cw{agedu --presort} to turn it into the sortable format; then sort it; \e{then} pipe it into \cw{agedu -L} (which can accept either the normal or the sortable format as input). For example: \c generate_custom_data.sh | agedu --presort | sort | agedu -L \e iiiiiiiiiiiiiiiiiiiiiii If you need to transform the sorted dump file back into the ordinary format, \cw{agedu --postsort} can do that. But since \cw{agedu -L} can accept either format as input, you may not need to. } \dt \cw{-h} or \cw{--help} \dd Causes \cw{agedu} to print some help text and terminate immediately. \dt \cw{-V} or \cw{--version} \dd Causes \cw{agedu} to print its version number and terminate immediately. \U OPTIONS This section describes the various configuration options that affect \cw{agedu}'s operation in one mode or another. The following option affects nearly all modes (except \cw{-S}): \dt \cw{-f} \e{filename} or \cw{--file} \e{filename} \dd Specifies the location of the index file which \cw{agedu} creates, reads or removes depending on its operating mode. By default, this is simply \cq{agedu.dat}, in whatever is the current working directory when you run \cw{agedu}. The following options affect the disk-scanning modes, \cw{-s} and \cw{-S}: \dt \cw{--cross-fs} and \cw{--no-cross-fs} \dd These configure whether or not the disk scan is permitted to cross between different file systems. The default is not to: \cw{agedu} will normally skip over subdirectories on which a different file system is mounted. This makes it convenient when you want to free up space on a particular file system which is running low. However, in other circumstances you might wish to see general information about the use of space no matter which file system it's on (for instance, if your real concern is your backup media running out of space, and if your backups do not treat different file systems specially); in that situation, use \cw{--cross-fs}. \lcont{ (Note that this default is the opposite way round from the corresponding option in \cw{du}.) } \dt \cw{--prune} \e{wildcard} and \cw{--prune-path} \e{wildcard} \dd These cause particular files or directories to be omitted entirely from the scan. If \cw{agedu}'s scan encounters a file or directory whose name matches the wildcard provided to the \cw{--prune} option, it will not include that file in its index, and also if it's a directory it will skip over it and not scan its contents. \lcont{ Note that in most Unix shells, wildcards will probably need to be escaped on the command line, to prevent the shell from expanding the wildcard before \cw{agedu} sees it. \cw{--prune-path} is similar to \cw{--prune}, except that the wildcard is matched against the entire pathname instead of just the filename at the end of it. So whereas \cw{--prune *a*b*} will match any file whose actual name contains an \cw{a} somewhere before a \cw{b}, \cw{--prune-path *a*b*} will also match a file whose name contains \cw{b} and which is inside a directory containing an \cw{a}, or any file inside a directory of that form, and so on. } \dt \cw{--exclude} \e{wildcard} and \cw{--exclude-path} \e{wildcard} \dd These cause particular files or directories to be omitted from the index, but not from the scan. If \cw{agedu}'s scan encounters a file or directory whose name matches the wildcard provided to the \cw{--exclude} option, it will not include that file in its index \dash but unlike \cw{--prune}, if the file in question is a directory it will still scan its contents and index them if they are not ruled out themselves by \cw{--exclude} options. \lcont{ As above, \cw{--exclude-path} is similar to \cw{--exclude}, except that the wildcard is matched against the entire pathname. } \dt \cw{--include} \e{wildcard} and \cw{--include-path} \e{wildcard} \dd These cause particular files or directories to be re-included in the index and the scan, if they had previously been ruled out by one of the above exclude or prune options. You can interleave include, exclude and prune options as you wish on the command line, and if more than one of them applies to a file then the last one takes priority. \lcont{ For example, if you wanted to see only the disk space taken up by MP3 files, you might run \c $ agedu -s . --exclude '*' --include '*.mp3' \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb which will cause everything to be omitted from the scan, but then the MP3 files to be put back in. If you then wanted only a subset of those MP3s, you could then exclude some of them again by adding, say, \cq{--exclude-path './queen/*'} (or, more efficiently, \cq{--prune ./queen}) on the end of that command. As with the previous two options, \cw{--include-path} is similar to \cw{--include} except that the wildcard is matched against the entire pathname. } \dt \cw{--progress}, \cw{--no-progress} and \cw{--tty-progress} \dd When \cw{agedu} is scanning a directory tree, it will typically print a one-line progress report every second showing where it has reached in the scan, so you can have some idea of how much longer it will take. (Of course, it can't predict \e{exactly} how long it will take, since it doesn't know which of the directories it hasn't scanned yet will turn out to be huge.) \lcont{ By default, those progress reports are displayed on \cw{agedu}'s standard error channel, if that channel points to a terminal device. If you need to manually enable or disable them, you can use the above three options to do so: \cw{--progress} unconditionally enables the progress reports, \cw{--no-progress} unconditionally disables them, and \cw{--tty-progress} reverts to the default behaviour which is conditional on standard error being a terminal. } \dt \cw{--dir-atime} and \cw{--no-dir-atime} \dd In normal operation, \cw{agedu} ignores the atimes (last access times) on the \e{directories} it scans: it only pays attention to the atimes of the \e{files} inside those directories. This is because directory atimes tend to be reset by a lot of system administrative tasks, such as \cw{cron} jobs which scan the file system for one reason or another \dash or even other invocations of \cw{agedu} itself, though it tries to avoid modifying any atimes if possible. So the literal atimes on directories are typically not representative of how long ago the data in question was last accessed with real intent to use that data in particular. \lcont{ Instead, \cw{agedu} makes up a fake atime for every directory it scans, which is equal to the newest atime of any file in or below that directory (or the directory's last \e{modification} time, whichever is newest). This is based on the assumption that all \e{important} accesses to directories are actually accesses to the files inside those directories, so that when any file is accessed all the directories on the path leading to it should be considered to have been accessed as well. In unusual cases it is possible that a directory itself might embody important data which is accessed by reading the directory. In that situation, \cw{agedu}'s atime-faking policy will misreport the directory as disused. In the unlikely event that such directories form a significant part of your disk space usage, you might want to turn off the faking. The \cw{--dir-atime} option does this: it causes the disk scan to read the original atimes of the directories it scans. The faking of atimes on directories also requires a processing pass over the index file after the main disk scan is complete. \cw{--dir-atime} also turns this pass off. Hence, this option affects the \cw{-L} option as well as \cw{-s} and \cw{-S}. (The previous section mentioned that there might be subtle differences between the output of \cw{agedu -s /path -D} and \cw{agedu -S /path}. This is why. Doing a scan with \cw{-s} and then dumping it with \cw{-D} will dump the fully faked atimes on the directories, whereas doing a scan-to-dump with \cw{-S} will dump only \e{partially} faked atimes \dash specifically, each directory's last modification time \dash since the subsequent processing pass will not have had a chance to take place. However, loading either of the resulting dump files with \cw{-L} will perform the atime-faking processing pass, leading to the same data in the index file in each case. In normal usage it should be safe to ignore all of this complexity.) } \dt \cw{--mtime} \dd This option causes \cw{agedu} to index files by their last modification time instead of their last access time. You might want to use this if your last access times were completely useless for some reason: for example, if you had recently searched every file on your system, the system would have lost all the information about what files you hadn't recently accessed before then. Using this option is liable to be less effective at finding genuinely wasted space than the normal mode (that is, it will be more likely to flag things as disused when they're not, so you will have more candidates to go through by hand looking for data you don't need), but may be better than nothing if your last-access times are unhelpful. \lcont{ Another use for this mode might be to find \e{recently created} large data. If your disk has been gradually filling up for years, the default mode of \cw{agedu} will let you find unused data to delete; but if you know your disk had plenty of space recently and now it's suddenly full, and you suspect that some rogue program has left a large core dump or output file, then \cw{agedu --mtime} might be a convenient way to locate the culprit. } \dt \cw{--logicalsize} \dd This option causes \cw{agedu} to consider the size of each file to be its \q{logical} size, rather than the amount of space it consumes on disk. (That is, it will use \c{st_size} instead of \c{st_blocks} in the data returned from \cw{stat}(\e{2}).) This option makes \cw{agedu} less accurate at reporting how much of your disk is used, but it might be useful in specialist cases, such as working around a file system that is misreporting physical sizes. \lcont{ For most files, the physical size of a file will be larger than the logical size, reflecting the fact that filesystem layouts generally allocate a whole number of blocks of the disk to each file, so some space is wasted at the end of the last block. So counting only the logical file size will typically cause under-reporting of the disk usage (perhaps \e{large} under-reporting in the case of a very large number of very small files). On the other hand, sometimes a file with a very large logical size can have \q{holes} where no data is actually stored, in which case using the logical size of the file will \e{over}-report its disk usage. So the use of logical sizes can give wrong answers in both directions. } The following option affects all the modes that generate reports: the web server mode \cw{-w}, the stand-alone HTML generation mode \cw{-H} and the text report mode \cw{-t}. \dt \cw{--files} \dd This option causes \cw{agedu}'s reports to list the individual files in each directory, instead of just giving a combined report for everything that's not in a subdirectory. The following option affects the text report mode \cw{-t}. \dt \cw{-a} \e{age} or \cw{--age} \e{age} \dd This option tells \cw{agedu} to report only files of at least the specified age. An age is specified as a number, followed by one of \cq{y} (years), \cq{m} (months), \cq{w} (weeks) or \cq{d} (days). (This syntax is also used by the \cw{-r} option.) For example, \cw{-a 6m} will produce a text report which includes only files at least six months old. The following options affect the stand-alone HTML generation mode \cw{-H} and the text report mode \cw{-t}. \dt \cw{-d} \e{depth} or \cw{--depth} \e{depth} \dd This option controls the maximum depth to which \cw{agedu} recurses when generating a text or HTML report. \lcont{ In text mode, the default is 1, meaning that the report will include the directory given on the command line and all of its immediate subdirectories. A depth of two includes another level below that, and so on; a depth of zero means \e{only} the directory on the command line. In HTML mode, specifying this option switches \cw{agedu} from writing out a single HTML file to writing out multiple files which link to each other. A depth of 1 means \cw{agedu} will write out an HTML file for the given directory and also one for each of its immediate subdirectories. If you want \cw{agedu} to recurse as deeply as possible, give the special word \cq{max} as an argument to \cw{-d}. } \dt \cw{-o} \e{filename} or \cw{--output} \e{filename} \dd This option is used to specify an output file for \cw{agedu} to write its report to. In text mode or single-file HTML mode, the argument is treated as the name of a file. In multiple-file HTML mode, the argument is treated as the name of a directory: the directory will be created if it does not already exist, and the output HTML files will be created inside it. The following option affects only the stand-alone HTML generation mode \cw{-H}, and even then, only in recursive mode (with \cw{-d}): \dt \cw{--numeric} \dd This option tells \cw{agedu} to name most of its output HTML files numerically. The root of the whole output file collection will still be called \cw{index.html}, but all the rest will have names like \cw{73.html} or \cw{12525.html}. (The numbers are essentially arbitrary; in fact, they're indices of nodes in the data structure used by \cw{agedu}'s index file.) \lcont{ This system of file naming is less intuitive than the default of naming files after the sub-pathname they index. It's also less stable: the same pathname will not necessarily be represented by the same filename if \cw{agedu -H} is re-run after another scan of the same directory tree. However, it does have the virtue that it keeps the filenames \e{short}, so that even if your directory tree is very deep, the output HTML files won't exceed any OS limit on filename length. } The following options affect the web server mode \cw{-w}, and in some cases also the stand-alone HTML generation mode \cw{-H}: \dt \cw{-r} \e{age range} or \cw{--age-range} \e{age range} \dd The HTML reports produced by \cw{agedu} use a range of colours to indicate how long ago data was last accessed, running from red (representing the most disused data) to green (representing the newest). By default, the lengths of time represented by the two ends of that spectrum are chosen by examining the data file to see what range of ages appears in it. However, you might want to set your own limits, and you can do this using \cw{-r}. \lcont{ The argument to \cw{-r} consists of a single age, or two ages separated by a minus sign. An age is a number, followed by one of \cq{y} (years), \cq{m} (months), \cq{w} (weeks) or \cq{d} (days). (This syntax is also used by the \cw{-a} option.) The first age in the range represents the oldest data, and will be coloured red in the HTML; the second age represents the newest, coloured green. If the second age is not specified, it will default to zero (so that green means data which has been accessed \e{just now}). For example, \cw{-r 2y} will mark data in red if it has been unused for two years or more, and green if it has been accessed just now. \cw{-r 2y-3m} will similarly mark data red if it has been unused for two years or more, but will mark it green if it has been accessed three months ago or later. } \dt \cw{--address} \e{addr}[\cw{:}\e{port}] \dd Specifies the network address and port number on which \cw{agedu} should listen when running its web server. If you want \cw{agedu} to listen for connections coming in from any source, specify the address as the special value \cw{ANY}. If the port number is omitted, an arbitrary unused port will be chosen for you and displayed. \lcont{ If you specify this option, \cw{agedu} will not print its URL on standard output (since you are expected to know what address you told it to listen to). } \dt \cw{--auth} \e{auth-type} \dd Specifies how \cw{agedu} should control access to the web pages it serves. The options are as follows: \lcont{ \dt \cw{magic} \dd This option only works on Linux, and only when the incoming connection is from the same machine that \cw{agedu} is running on. On Linux, the special file \cw{/proc/net/tcp} contains a list of network connections currently known to the operating system kernel, including which user id created them. So \cw{agedu} will look up each incoming connection in that file, and allow access if it comes from the same user id under which \cw{agedu} itself is running. Therefore, in \cw{agedu}'s normal web server mode, you can safely run it on a multi-user machine and no other user will be able to read data out of your index file. \dt \cw{basic} \dd In this mode, \cw{agedu} will use HTTP Basic authentication: the user will have to provide a username and password via their browser. \cw{agedu} will normally make up a username and password for the purpose, but you can specify your own; see below. \dt \cw{none} \dd In this mode, the web server is unauthenticated: anyone connecting to it has full access to the reports generated by \cw{agedu}. Do not do this unless there is nothing confidential at all in your index file, or unless you are certain that nobody but you can run processes on your computer. \dt \cw{default} \dd This is the default mode if you do not specify one of the above. In this mode, \cw{agedu} will attempt to use Linux magic authentication, but if it detects at startup time that \cw{/proc/net/tcp} is absent or non-functional then it will fall back to using HTTP Basic authentication and invent a user name and password. } \dt \cw{--auth-file} \e{filename} or \cw{--auth-fd} \e{fd} \dd When \cw{agedu} is using HTTP Basic authentication, these options allow you to specify your own user name and password. If you specify \cw{--auth-file}, these will be read from the specified file; if you specify \cw{--auth-fd} they will instead be read from a given file descriptor which you should have arranged to pass to \cw{agedu}. In either case, the authentication details should consist of the username, followed by a colon, followed by the password, followed \e{immediately} by end of file (no trailing newline, or else it will be considered part of the password). \dt \cw{--title} \e{title} \dd Specify the string that appears at the start of the \cw{} section of the output HTML pages. The default is \cq{agedu}. This title is followed by a colon and then the path you're viewing within the index file. You might use this option if you were serving \cw{agedu} reports for several different servers and wanted to make it clearer which one a user was looking at. \dt \cw{--launch} \e{shell-command} \dd Specify a command to be run with the base URL of the web user interface, once the web server has started up. The command will be interpreted by \cw{/bin/sh}, and the base URL will be appended to it as an extra argument word. \lcont{ A typical use for this would be \cq{--launch=browse}, which uses the XDG \cq{browse} command to automatically open the \cw{agedu} web interface in your default browser. However, other uses are possible: for example, you could provide a command which communicates the URL to some other software that will use it for something. } \dt \cw{--no-eof} \dd Stop \cw{agedu} in web server mode from looking for end-of-file on standard input and treating it as a signal to terminate. \U LIMITATIONS The data file is pretty large. The core of \cw{agedu} is the tree-based data structure it uses in its index in order to efficiently perform the queries it needs; this data structure requires \cw{O(N log N)} storage. This is larger than you might expect; a scan of my own home directory, containing half a million files and directories and about 20Gb of data, produced an index file over 60Mb in size. Furthermore, since the data file must be memory-mapped during most processing, it can never grow larger than available address space, so a \e{really} big filesystem may need to be indexed on a 64-bit computer. (This is one reason for the existence of the \cw{-D} and \cw{-L} options: you can do the scanning on the machine with access to the filesystem, and the indexing on a machine big enough to handle it.) The data structure also does not usefully permit access control within the data file, so it would be difficult \dash even given the willingness to do additional coding \dash to run a system-wide \cw{agedu} scan on a \cw{cron} job and serve the right subset of reports to each user. In certain circumstances, \cw{agedu} can report false positives (reporting files as disused which are in fact in use) as well as the more benign false negatives (reporting files as in use which are not). This arises when a file is, semantically speaking, \q{read} without actually being physically \e{read}. Typically this occurs when a program checks whether the file's mtime has changed and only bothers re-reading it if it has; programs which do this include \cw{rsync}(\e{1}) and \cw{make}(\e{1}). Such programs will fail to update the atime of unmodified files despite depending on their continued existence; a directory full of such files will be reported as disused by \cw{agedu} even in situations where deleting them will cause trouble. Finally, of course, \cw{agedu}'s normal usage mode depends critically on the OS providing last-access times which are at least approximately right. So a file system mounted with Linux's \cq{noatime} option, or the equivalent on any other OS, will not give useful results! (However, the Linux mount option \cq{relatime}, which distributions now tend to use by default, should be fine for all but specialist purposes: it reduces the accuracy of last-access times so that they might be wrong by up to 24 hours, but if you're looking for files that have been unused for months or years, that's not a problem.) \U LICENCE \cw{agedu} is free software, distributed under the MIT licence. Type \cw{agedu --licence} to see the full licence text. \versionid agedu version 20211129.8cd63c5 ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/TODO�������������������������������������������������������������������������0000644�0001750�0001750�00000027551�14151034324�013653� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������TODO list for agedu =================== - flexibility in the HTML report output mode: expose the internal mechanism for configuring the output filenames, and allow the user to request individual files with hyperlinks as if the other files existed. (In particular, functionality of this kind would enable other modes of use like the built-in --cgi mode, without me having to anticipate them in detail.) - non-ASCII character set support + could usefully apply to --title and also to file names + how do we determine the input charset? Via locale, presumably. + how do we do translation? Importing my charset library is one heavyweight option; alternatively, does the native C locale mechanism provide enough functionality to do the job by itself? + in HTML, we would need to decide on an _output_ character set, specify it in a <meta http-equiv> tag, and translate to it from the input locale - one option is to make the output charset the same as the input one, in which case all we need is to identify its name for the <meta> tag - the other option is to make the output charset UTF-8 always and translate to that from everything else - in the web server and CGI modes, it would probably be nicer to move that <meta> tag into a proper HTTP header + even in text mode we would want to parse the filenames in some fashion, due to the unhelpful corner case of Shift-JIS Windows (in which backslashes in the input string must be classified as path separators or the second byte of a two-byte character) - that's really painful, since it will impact string processing of filenames throughout the code - so perhaps a better approach would be to do locale processing of filenames at _scan_ time, and normalise to UTF-8 in both the index and dump files? + involves incrementing the version of the dump-file format + then paths given on the command line are translated quickly to UTF-8 before comparing them against index paths + and now the HTML output side becomes easy, though the text output involves translating back again + but what if the filenames aren't intended to be interpreted in any particular character set (old-style Unix semantics) or in a consistent one? - we could still be using more of the information coming from autoconf. Our config.h is defining a whole bunch of HAVE_FOOs for particular functions (e.g. HAVE_INET_NTOA, HAVE_MEMCHR, HAVE_FNMATCH). We could usefully supply alternatives for some of these functions (e.g. cannibalise the PuTTY wildcard matcher for use in the absence of fnmatch, switch to vanilla truncate() in the absence of ftruncate); where we don't have alternative code, it would perhaps be polite to throw an error at configure time rather than allowing the subsequent build to fail. + however, I don't see anything here that looks very controversial; IIRC it's all in POSIX, for one thing. So more likely this should simply wait until somebody complains. - run-time configuration in the HTTP server * I think this probably works by having a configuration form, or a link pointing to one, somewhere on the report page. If you want to reconfigure anything, you fill in and submit the form; the web server receives HTTP GET with parameters and a referer, adjusts its internal configuration, and returns an HTTP redirect back to the referring page - which it then re-renders in accordance with the change. * All the same options should have their starting states configurable on the command line too. - curses-ish equivalent of the web output + try using xterm 256-colour mode. Can (n)curses handle that? If not, try doing it manually. + I think my current best idea is to bypass ncurses and go straight to terminfo: generate lines of attribute-interleaved text and display them, so we only really need the sequences "go here and display stuff", "scroll up", "scroll down". + Infrastructure work before doing any of this would be to split html.c into two: one part to prepare an abstract data structure describing an HTML-like report (in particular, all the index lookups, percentage calculation, vector arithmetic and line sorting), and another part to generate the literal HTML. Then the former can be reused to produce very similar reports in coloured plain text. - abstracting away all the Unix calls so as to enable a full Windows port. We can already do the difficult bit on Windows (scanning the filesystem and retrieving atime-analogues). Everything else is just coding - albeit quite a _lot_ of coding, since the Unix assumptions are woven quite tightly into the current code. + If nothing else, it's unclear what the user interface properly ought to be in a Windows port of agedu. A command-line job exactly like the Unix version might be useful to some people, but would certainly be strange and confusing to others. - replacing the index data structure with a layered range tree. + The basic idea of the data structure: * store a search tree ordered by atime, in which every node stores an array of pairs (pathname index, cumulative size) describing the data in that node's subtree, ordered by the pathname index. * Also, alongside each element in a node's array is the index of the corresponding location in the sort order in the array at each child node. + To search the tree for a given (path index, atime) query point: * Start by doing a log-time search for the path index in the array at the root node. * Now, every time you descend to a child node, you can use the crosslinks between the arrays to find the corresponding search position in the new node's array, in constant time. * So your overall search still takes log time, and from every node you can read off the total size of stuff in that subtree less than the query path index, which is enough to find the total size of stuff in the desired quadrant. + This also enables a 3-sided query (i.e. specifying both start and end pathname index) in a single operation: just start by finding _both_ your pathname index endpoints in the root node's array, and then as you descend the tree you can read off the total size of stuff in each subtree _between_ those two endpoints, by subtracting the two cumulative sizes. * For the existing kind of query, this is only a minor optimisation, and doesn't affect the asymptotic performance of the query. * But the advantage is that it also lets you do an inverse lookup, of the form 'Given a pathname interval (subdirectory) and a target *size* X, find an atime T such that the amount of data in that subdir older than T is X.' In the current agedu data structure that takes log^2 time, because you have to do a log-time query for each step of a binary search over candidate atimes; but if you do a 3-sided search in this range tree system, you can use the subtree sizes instead of the tree sort order to decide which direction to go as you descend the tree, just like querying an ordinary annotated tree. * I think this inverse lookup could be helpful for speeding up the HTML report generation. The normal query answers the question 'Where should the boundary be between these two colour values?', and agedu runs that query once per output colour. But the inverse query answers 'What colour should _this pixel_ in the output bar be?', so for very small bars (of which there will be a lot), inverse queries would surely get the needed information faster. - Also, I quite like the idea of a hybrid strategy: + for a bar with fewer pixels than colours, determine the colour of its middle pixel by an inverse lookup, then recurse into each half + but if there are more pixels than colours, determine the location of the middle colour by a forward lookup, and recurse similarly + and at each level of the recursion, pick one of those strategies as appropriate. + Without having actually implemented it in practice, my guess is that this structure as described above is comparable in both space and time to the current one. It's still N log N space in principle; you still get to optimise to some degree by disregarding uninteresting differences between pathname indices (in this case, by compressing the arrays at each node, rather than by discarding intermediate states of the evolving tree); and it still takes log time to answer the existing kind of query, namely 'what is the total size of stuff in this subdir older than this atime?'. + There's also a way to compress the index to a smaller size, at the cost of query time, by replacing the array at each tree node with a bit-vector that just says which child each logical array element came from, and storing a cumulative element count and cumulative size per _word_ of the bit-vector instead of a cumulative size for every element. This means that at every step down the tree you can only compute the cumulative size approximately from the data actually stored at that tree node, and to fix up the errors you have to look at the two partial words at each end of your search and chase each individual element down the tree to the bottom to find its exact size. Done naively, this gives you O(log^3 n) lookup time (at each of the log n steps down the tree, there are log n extra points you have to sort out, each of which needs another log n pass to the bottom of the tree); you can improve on that by storing extra indices (also bit-packed) that allow you to jump multiple levels down the tree, permitting an assortment of tradeoffs that let you _almost_ get rid of a log factor - you can have O(n) space and O(n log^{2+e} n) for e as small as you want (paying for a smaller e with a larger constant factor), or conversely O(n^{1+e}) space and exactly O(n log^2 n) query time, or the in-between compromise of O(n log log n) space and O(n log^2 n log log n) query time. + source: "A Functional Approach to Data Structures and Its Use in Multidimensional Searching", B. Chazelle, SIAM J. Comput. 17 (1988), 427-462 + and I _think_ it should in principle be possible to vary the frequency of cumulative counts/sizes so as to trade off space and time smoothly between that compressed representation and the full layered range tree. - A user requested what's essentially a VFS layer: given multiple index files and a map of how they fit into an overall namespace, we should be able to construct the right answers for any query about the resulting aggregated hierarchy by doing at most O(number of indexes * normal number of queries) work. - Support for filtering the scan by ownership and permissions. The index data structure can't handle this, so we can't build a single index file admitting multiple subset views; but a user suggested that the scan phase could record information about ownership and permissions in the dump file, and then the indexing phase could filter down to a particular sub-view - which would at least allow the construction of various subset indices from one dump file, without having to redo the full disk scan which is the most time-consuming part of all. �������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/LICENCE����������������������������������������������������������������������0000644�0001750�0001750�00000002073�14151034324�014140� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu is copyright 2008 Simon Tatham. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/CMakeLists.txt���������������������������������������������������������������0000644�0001750�0001750�00000007770�14151034324�015724� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������cmake_minimum_required(VERSION 3.7) project(agedu LANGUAGES C) if(NOT CMAKE_SYSTEM_NAME MATCHES "Windows") # On Unix, build the main agedu program. set(AGEDU_IPV6 ON CACHE BOOL "Build agedu with IPv6 support if possible") set(AGEDU_IPV4 ON CACHE BOOL "Build agedu with IPv4 support if possible") add_executable(agedu agedu.c du.c alloc.c trie.c index.c html.c httpd.c fgetline.c licence.c dumpfile.c) include(CheckIncludeFile) check_include_file(arpa/inet.h HAVE_ARPA_INET_H) check_include_file(assert.h HAVE_ASSERT_H) check_include_file(ctype.h HAVE_CTYPE_H) check_include_file(dirent.h HAVE_DIRENT_H) check_include_file(errno.h HAVE_ERRNO_H) check_include_file(fcntl.h HAVE_FCNTL_H) check_include_file(features.h HAVE_FEATURES_H) check_include_file(fnmatch.h HAVE_FNMATCH_H) check_include_file(limits.h HAVE_LIMITS_H) check_include_file(ndir.h HAVE_NDIR_H) check_include_file(netdb.h HAVE_NETDB_H) check_include_file(netinet/in.h HAVE_NETINET_IN_H) check_include_file(pwd.h HAVE_PWD_H) check_include_file(stdarg.h HAVE_STDARG_H) check_include_file(stdbool.h HAVE_STDBOOL_H) check_include_file(stddef.h HAVE_STDDEF_H) check_include_file(stdint.h HAVE_STDINT_H) check_include_file(stdio.h HAVE_STDIO_H) check_include_file(stdlib.h HAVE_STDLIB_H) check_include_file(string.h HAVE_STRING_H) check_include_file(sys/dir.h HAVE_SYS_DIR_H) check_include_file(sys/ioctl.h HAVE_SYS_IOCTL_H) check_include_file(sys/mman.h HAVE_SYS_MMAN_H) check_include_file(sys/ndir.h HAVE_SYS_NDIR_H) check_include_file(sys/select.h HAVE_SYS_SELECT_H) check_include_file(sys/socket.h HAVE_SYS_SOCKET_H) check_include_file(sys/stat.h HAVE_SYS_STAT_H) check_include_file(sys/types.h HAVE_SYS_TYPES_H) check_include_file(sys/wait.h HAVE_SYS_WAIT_H) check_include_file(syslog.h HAVE_SYSLOG_H) check_include_file(termios.h HAVE_TERMIOS_H) check_include_file(time.h HAVE_TIME_H) check_include_file(unistd.h HAVE_UNISTD_H) include(CheckSymbolExists) check_symbol_exists(fdopendir "sys/types.h;dirent.h" HAVE_FDOPENDIR) check_symbol_exists(getaddrinfo "sys/types.h;sys/socket.h;netdb.h" HAVE_GETADDRINFO) check_symbol_exists(gethostbyname "netdb.h" HAVE_GETHOSTBYNAME) check_symbol_exists(lstat64 "sys/types.h;sys/stat.h;unistd.h" HAVE_LSTAT64) check_symbol_exists(stat64 "sys/types.h;sys/stat.h;unistd.h" HAVE_STAT64) check_symbol_exists(strsignal "string.h" HAVE_STRSIGNAL) set(GENERATED_SOURCES_DIR ${CMAKE_CURRENT_BINARY_DIR}${CMAKE_FILES_DIRECTORY}) include_directories(${GENERATED_SOURCES_DIR}) configure_file(cmake.h.in ${GENERATED_SOURCES_DIR}/cmake.h) # If Halibut is available, build the docs too. find_program(HALIBUT halibut) if(HALIBUT) set(BUILD_MANPAGE ON) add_custom_command(OUTPUT agedu.1 COMMAND ${HALIBUT} --man=agedu.1 ${CMAKE_CURRENT_SOURCE_DIR}/agedu.but DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/agedu.but) add_custom_target(doc ALL DEPENDS agedu.1) else() set(BUILD_MANPAGE OFF) endif() # Installation include(GNUInstallDirs) if(CMAKE_VERSION VERSION_LESS 3.14) # CMake 3.13 and earlier required an explicit install destination. install(TARGETS agedu RUNTIME DESTINATION bin) else() # 3.14 and above selects a sensible default, which we should avoid # overriding here so that end users can override it using # CMAKE_INSTALL_BINDIR. install(TARGETS agedu) endif() if(BUILD_MANPAGE) install(FILES ${CMAKE_BINARY_DIR}/agedu.1 DESTINATION ${CMAKE_INSTALL_MANDIR}/man1) elseif(EXISTS ${CMAKE_SOURCE_DIR}/agedu.1) # If we weren't able to build the man page from source, but we are # building from a source tarball in which a pre-built man page is # provided, we can install that. install(FILES ${CMAKE_SOURCE_DIR}/agedu.1 DESTINATION ${CMAKE_INSTALL_MANDIR}/man1) endif() endif() if(CMAKE_SYSTEM_NAME MATCHES "Windows") # On Windows, build ageduscan.exe. add_compile_definitions(_CRT_SECURE_NO_WARNINGS) add_executable(ageduscan winscan.c) endif() ��������agedu-20211129.8cd63c5/Buildscr���������������������������������������������������������������������0000644�0001750�0001750�00000003001�14151034324�014635� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������# -*- sh -*- # # bob script to build the agedu tarball. module agedu set Version $(!builddate).$(vcsid) # use perl to avoid inconsistent behaviour of echo '\v' in agedu do perl -e 'print "$#define AGEDU_VERSION \"$$ARGV[0]\"\n"' $(Version) >> version.h in agedu do perl -e 'print "\n\\versionid agedu version $$ARGV[0]\n"' $(Version) >> agedu.but # Build the man page. While we're in there, we also ensure agedu # builds, and runs its help. in . do mkdir docbuild in docbuild do cmake ../agedu in docbuild do make -j$(nproc) VERBOSE=1 in docbuild do ./agedu --help in . do cp -R agedu agedu-$(Version) in . do cp -R docbuild/agedu.1 agedu-$(Version) in . do tar chzvf agedu-$(Version).tar.gz agedu-$(Version) in agedu do halibut --html=manpage.html agedu.but in agedu do halibut --html=tree.html tree.but deliver agedu-$(Version).tar.gz $@ deliver agedu/manpage.html $@ deliver agedu/tree.html $@ in . do mkdir winbuild in winbuild do cmake ../agedu -DCMAKE_TOOLCHAIN_FILE=$(cmake_toolchain_clangcl64) -DCMAKE_BUILD_TYPE=Release -DCMAKE_MSVC_RUNTIME_LIBRARY=MultiThreaded -DCMAKE_C_FLAGS_RELEASE="/MT /O2" in winbuild do make -j$(nproc) VERBOSE=1 # Code-sign this Windows binary, if the local bob config provides a # script to do so. We assume here that the script accepts an -i # option to provide a 'more info' URL, and that it signs the file in # place. ifneq "$(cross_winsigncode)" "" in winbuild do $(cross_winsigncode) -i https://www.chiark.greenend.org.uk/~sgtatham/agedu/ ageduscan.exe deliver winbuild/ageduscan.exe $@ �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������agedu-20211129.8cd63c5/.gitignore�������������������������������������������������������������������0000644�0001750�0001750�00000000454�14151034324�015144� 0����������������������������������������������������������������������������������������������������ustar �simon���������������������������simon������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������/*.o /*.d /agedu /*.dat /*.dump /agedu.1 /Makefile.local /install-sh /aclocal.m4 /configure.in /configure /Makefile.in /config.h.in /Makefile /depcomp /missing /stamp-h1 /compile /autom4te.cache /build.log /build.out /config.status /config.log /config.h /manpage.html /tree.html /.deps /*.exe /*.obj ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������