collectl-4.3.1/ 0000775 0001750 0001750 00000000000 13366602004 011466 5 ustar mjs mjs collectl-4.3.1/collectl.conf 0000775 0001750 0001750 00000015455 13366602004 014153 0 ustar mjs mjs # Copyright 2003-2009 Hewlett-Packard Development Company, LP # Like most Linux configuration files, this specifies a set of user controller # parameters. In many cases these are commented out which simply means those # are the default values already used by collectl. To change a value # uncomment the line and change it. To revert back to the default all you # need do is recomment the line. ############################ # daemon/service handling ############################ # When someone specifies a daemon is to be run, typically but not limited # to running collectl as a service, this string will cause the associated # values to be used. They CAN also be overriden via a command line switch. # In other words, if DaemonCommands is set to '-s cdm' and collectl is # envoked with -D, it will process subsystems 'cdm'. However, if it is envoked # with '-D -s mnp' it will process subsystems 'mnp', there is no combining the # two set of values. Be sure to include any switches a daemon is required to # have such as -f and either -r or -R. # NOTE - if things aren't behaving as expected, you can always try running # collectl in non-daemon mode just to see if there are any error messages. If # you include the -m switch, you can also look in the collectl log, which is # stored in the logging directory. DaemonCommands = -f /var/log/collectl -r00:00,7 -m -F60 -s+YZ # This defines the location to look for all additional required files if formatit.ph # is not in the same directory as collectl itself. #ReqDir = /usr/share/collectl # E x t r a L i b r a r i e s # So far this has only been used during development, but if there are extra # library locations that should be 'used', put them here. #Libraries = # S t a n d a r d U t i l i t i e s # Note that by default collectl will look for lspci in both /sbin and # /usr/sbin, but if listed here will only look in that one place. #Grep = /bin/grep #Egrep = /bin/egrep #Ps = /bin/ps #Rpm = /bin/rpm #Lspci = /sbin/lspci #Lctl = /usr/sbin/lctl # I n f i n i b a n d S u p p o r t # Collectl will assume open fabric and will attempt to use the perfquery # utility to get the counters. If not there, it assumes Voltaire and will # first look in /proc/voltaire/adaptor-mlx/stats and failing that will use # the get_pcounter utiliy. Since collectl resets IB counters in the # hardware you can disable its collection by commenting out the appropriate # variable below. PQuery for OFED, PCounter for get_pcounter calls and # VStat for ALL non-ofed access of any kind. # can disable either by commenting out the reference to VStat/PQuery below. PQuery = /usr/sbin/perfquery:/usr/bin/perfquery:/usr/local/ofed/bin/perfquery PCounter = /usr/mellanox/bin/get_pcounter VStat = /usr/mellanox/bin/vstat:/usr/bin/vstat OfedInfo = /usr/bin/ofed_info:/usr/local/ofed/bin/ofed_info # D e f a u l t s # This set of variables are actually all set in collectl and you need not # change them. # This parameter controls subsystem selection. The 'core' subsystems are # selected when the user omits the -s switch OR uses the '+' or '-' to # add/remove from that list. Note that changing this will also change the # default for -s displayed in help. #SubsysCore = bcdfijlmnstx # although these can all be overridden by switches, they're assumed to # always be defined so don't remove or comment any of them out! Over time # more may be added #Interval = 10 #Interval2 = 60 #Interval3 = 120 # These are SFS lustre specific. When using the -OD switch, any partitions # found to be smaller than LustreSvcLunMax, which is in GB, will be ignored. # When displaying data in verbose mode, only LustreMaxBlkSize will be # displayed, but ALL block sizes will be read and recorded #LustreSvcLunMax = 10 #LustreMaxBlkSize = 512 # By default, we check at these frequencies to see if lustre or interconnect # configurations have changed. Things are efficient enough that now we can # check for lustre changes every polling interval but I'm leaving the code # in place rather than remove it in case needed again in the future. #LustreConfigInt = 1 #InterConnectInt = 900 # These apply to disk/partition limits for exception (-o x/X) processing #LimSVC = 30 # Minimum partition Avg Service time #LimIOS = 10 # Minumum number of Disk OR Partion I/Os #LimBool = 0 # generate exception record if EITHER limit exceeded #LimLusKBS = 100 # Minimum number of Lustre OSS KB/sec #LimLusReints = 1000 # Minimum number o Lustre MDS Reint operations # Socket I/O Defaults #Port = 2655 #Timeout = 10 # Maximum allowable zlib errors in a single day or run. #MaxZlibErrors = 20 # To disable bogus network data checking, set this to any negative value #DefNetSpeed=10000 # Collectl will automatically size the frequency of headers in 'brief format' # to the height of your display window which it determines using the resize # utility. If that utility can't be found, it will use the height speficied # in 'TermSize'. If 'resize' is in your path but you want a fixed/different # size, comment out the Resize line and uncomment TermHeight, setting it to # what you want. #TermHeight = 24 Resize=/usr/bin/resize:/usr/X11R6/bin/resize # To turn off Time:HiRes/glibc incompatibility checking, the following # should be enabled and set to 0 #TimeHiResCheck = 1 # These control environmental monitoring and to use it you MUST have ipmitool # installed (see http://ipmitool.sourceforge.net/). If not in the path shown # below, you must change it. Ipmitool = /usr/bin/ipmitool:/usr/local/bin/ipmitool:/opt/hptc/sbin/ipmitool IpmiCache = /var/run/collectl-ipmicache IpmiTypes = fan,temp,current # passwd file for UID to usernames mapping during process monitoring #Passwd = /etc/passwd # If a cciss device is reset (such as when during a lun scan) while collectl running, # disk rates will be excessive. If one seen above the following, reset ALL stats for # that disk to 0. To disable set this to -1 #DiskMaxValue=5000000 # When collectl reads disk data, it filters out any that don't match the DiskFilter, # which by default looks for cciss, hd, sd, xvd, dm, emcpower and psv. All others are # ignored. To change the filter, set the string below to those you want to keep BUT # you need to know what a perl regular expression looks like or you may not get the # desired results. CAUTION - white space is CRITICAL for this to work. #DiskFilter = /hd[ab] | sd[a-z]+ |dm-\d+ |xvd[a-z] |fio[a-z]+ | vd[a-z]+ |emcpower[a-z]+ |psv\d+ |nvme\d+n\d+ / # Kernel Efficiency Test # On kernels 2.6.32 forward (and you can't tell how distros patched) there is a read inefficiency # in the /proc filesystem for 4 and more sockets and the only way to tell is to test it. If slow # generate a warning that patching the kernel may be recommmended. To bypass the test/message, set # the following to 'no' #ProcReadTest = yes collectl-4.3.1/docs/ 0000775 0001750 0001750 00000000000 13366602004 012416 5 ustar mjs mjs collectl-4.3.1/docs/SlowProc.html 0000664 0001750 0001750 00000015600 13366602004 015056 0 ustar mjs mjs
The good news is newer RedHat and SUSE distros have been updated to mitigate this problem, specifically RHEL 6.2 and SLES SP1. As for other distros, I just don't have the access to verify everything, so if you are running a different distro and can verify this problem has been resolved, I'd appreciate hearing about which specific version addresses it so I can publish the news here.
So why should you even care about this? You may have high core counts and running a kernel that has not yet been patched. While this probably won't have any impact on any of your running applications - but do you ever run top? iostat? sar? or any other monitoring tools? If you're reading this you probably run collectl. Most monitoring tools are farily light-weight and for a good reason - if you're trying to measure something you don't want the tool's overhead to get in the way. Unfortunately with this regression it will now!
In the following example, you can see monitoring CPU data takes about 3 seconds to read almost 9K samples and write them to a file on a 2-socket/dual-core system. Very efficient!
time collectl -sc -i0 -c 8640 -f/tmp real 0m2.879s user 0m1.908s sys 0m0.913s
time collectl -sc -i0 -c 864 -f/tmp real 0m16.783s user 0m3.003s sys 0m13.523s
Since a simply uname command will tell you your kernel version, you might think that's all it takes, but nothing is always that simple because most vendors patch their kernels and you can't always be sure what code it's actually running.
One simple way to tell for sure is to run the very simple test below which times a read of /proc/stat (which seems to be the most heavily effected) by using strace see how much time is spent in the actually read.
The following is on my 2-socket/dual-core system:
strace -c cat /proc/stat>/dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000251 251 1 execve 0.00 0.000000 0 3 read 0.00 0.000000 0 1 write 0.00 0.000000 0 4 open 0.00 0.000000 0 5 close 0.00 0.000000 0 5 fstat 0.00 0.000000 0 8 mmap 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 1 munmap 0.00 0.000000 0 3 brk 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 uname 0.00 0.000000 0 1 arch_prctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000251 37 1 total
strace -c cat /proc/stat >/dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.014997 4999 3 read 0.00 0.000000 0 1 write 0.00 0.000000 0 20 16 open 0.00 0.000000 0 6 close 0.00 0.000000 0 12 10 stat 0.00 0.000000 0 5 fstat 0.00 0.000000 0 8 mmap 0.00 0.000000 0 3 mprotect 0.00 0.000000 0 1 munmap 0.00 0.000000 0 4 brk 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 arch_prctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.014997 66 27 total
updated Jan 18, 2012 |
Node 0, zone DMA 5 5 3 4 2 4 3 1 0 0 0 Node 0, zone DMA32 79 61 4 12 0 0 0 2 0 1 0 Node 1, zone DMA32 134 57 27 60 0 0 1 1 0 1 0 Node 2, zone Normal 865 357 37 1 6 0 2 1 0 1 0 Node 3, zone Normal 651 47 19 10 1 1 1 1 0 1 0
collectl -sb --verbose -oT # MEMORY FRAGMENTATION SUMMARY (4K pages) # 1 2 4 8 16 32 64 128 256 512 1024 16:11:26 1296 483 157 85 9 5 7 6 0 4 0 16:11:27 1354 485 163 87 9 5 7 6 0 4 0 16:11:28 1395 480 165 89 9 5 7 6 0 4 0
collectl -sB -oT # MEMORY FRAGMENTATION (4K pages) # Node Zone 1 2 4 8 16 32 64 128 256 512 1024 16:13:33 0 DMA 5 5 3 4 2 4 3 1 0 0 0 16:13:33 0 DMA32 175 97 3 12 0 0 0 2 0 1 0 16:13:33 1 DMA32 933 389 60 68 0 0 1 1 0 1 0 16:13:33 2 Normal 0 1 8 1 6 0 2 1 0 1 0 16:13:33 3 Normal 1 2 57 10 1 1 1 1 0 1 0
collectl -sb -oT # <---Memory--> #Time Fragments 16:24:13 qmji9576040 16:24:14 rmji9576040 16:24:15 smji9576040 16:24:16 smji9576040 16:24:17 rmji9576040
collectl -sbcmn -oT # <--------CPU--------><------------------Memory-----------------><----------Network----------> #Time cpu sys inter ctxsw Free Buff Cach Inac Slab Map Fragments KBIn PktIn KBOut PktOut 16:44:46 0 0 1029 146 23M 178M 6G 5G 461M 234M lljj9576040 2 8 0 2 16:44:47 0 0 1020 136 24M 178M 6G 5G 461M 234M nljj9576040 2 8 1 2 16:44:48 1 0 1062 371 22M 178M 6G 5G 461M 235M kljj9576040 3 31 2 27 16:44:49 0 0 1009 146 22M 178M 6G 5G 461M 235M kljj9576040 14 13 0 2
updated Feb 12, 2009 |
Let's say you have multiple devices such as disks and the one you're interested in is misbehaving and reading or writing too slowly. Won't the total disk activity also be low? Similary if disk traffic is being reported high on a lightly loaded system won't this also jump out you by simply looking at the total disk activity? The same can be said about networks and most other subsystems for which there is summary data and simply looking at the totals will often alert you to the fact that something is not right. The key thing to keep in mind that you are looking at totals.
CPU monitoring can be a little tricker as these are reported as averages as opposed to totals and as the number of cores increase so does the divisor of the calculation. In most cases when you have a system with excessive load, it will effect all CPUs and so will be very visible even as an average, but in some cases it won't. What if you have a 2 core system and see a CPU load of 45% when you're expecting a much lighter load? Looking at individual CPUs you may see one running at near zero load and the second at 90%. Your only clue was that the 45% load was unexpected and so you looked closer. But what if you had a heavily loaded CPU on a 48 core system? You'd never even realize it. In other words, just pay attention.
Stated slightly differently, summary data is often a starting point to help identify potential trouble ares and from there you can determine if you need to dig deeper.
So why brief data?
It you have ever tried to look at multiple lines of different text and identify what was changing over time you should already know the answer - it's really difficult! For example, here's what collectl might show for CPU, Disk and Network data:
collectl.pl --verbose ### RECORD 1 >>> poker <<< (1314712401.002) (Tue Aug 30 09:53:21 2011) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # User Nice Sys Wait IRQ Soft Steal Idle CPUs Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15 0 0 0 0 0 0 0 100 4 1120 192 0 363 0 0.00 0.00 0.00 # DISK SUMMARY (/sec) #KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB 0 0 0 0 0 0 0 0 # NETWORK SUMMARY (/sec) # KBIn PktIn SizeIn MultI CmpI ErrsI KBOut PktOut SizeO CmpO ErrsO 0 1 60 0 0 0 0 0 0 0 0 ### RECORD 2 >>> poker <<< (1314712402.002) (Tue Aug 30 09:53:22 2011) ### # CPU SUMMARY (INTR, CTXSW & PROC /sec) # User Nice Sys Wait IRQ Soft Steal Idle CPUs Intr Ctxsw Proc RunQ Run Avg1 Avg5 Avg15 0 0 0 0 0 0 0 99 4 1111 200 0 363 0 0.00 0.00 0.00 # DISK SUMMARY (/sec) #KBRead RMerged Reads SizeKB KBWrite WMerged Writes SizeKB 0 0 0 0 256 59 5 51 # NETWORK SUMMARY (/sec) # KBIn PktIn SizeIn MultI CmpI ErrsI KBOut PktOut SizeO CmpO ErrsO 0 2 60 0 0 0 0 3 328 0 0
Now consider the fact that in many cases seeing network errors or disk merges or even the percentage of time the CPU spent processing interrupts, while important, may not be when trying to identify anomalous behaviors. And that's where brief mode comes in. Here we are identifying those few nuggets of information which will tell us whether or not things are functioning as expected such that we can display them all on the same line and make it easier to spot change. In fact, during the following run I did a ping -f and see how easy it is to spot the network burst?
collectl #<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut 0 0 1124 203 0 0 240 4 0 0 0 0 0 0 1105 253 0 0 12 2 0 1 0 1 0 0 1123 206 0 0 0 0 0 3 0 2 2 1 6051 8584 0 0 0 0 173 2099 297 2860 3 2 7828 11270 0 0 0 0 222 2770 411 3936 0 0 1115 204 0 0 92 5 0 5 1 5 0 0 1121 198 0 0 0 0 0 1 0 1
In summary, just keep in mind that there is no single recipe for how to monitor a system, what format to display the output in and how to drill deeper. However, as you become more familiar with the types of data and collectl formats your ability to better utilize collectl will increase.
updated Sept 19, 2011 |
In most cases this is of minimal interest unless you're trying to track down a specific socket related problem. In the cases of a runaway process or someone opening but not closing sockets this number has been seen to grow quite large and even consume all resources causing a system crash, but those cases are pretty rare. In any event, during times of strange behavior it can't hurt to have a look at these numbers if for no other reason than to rule out socket problems.
updated August 30, 2011 |
As its name implies, colmux is a collectl multiplexor, which allows one to collect data from multiple systems and treat it as a single data stream, essentially extending collectl's functionality to a set of hosts rather than a single one. Colmux has been tested on clusters of over 1000 nodes but one should also take note that this will put a heavier load on the system on which colmux is running.
Colmux runs in 2 distinct modes: Real-Time and Playback. In real-time mode, colmux actually communicates with instances of collectl running on remote systems which in turn are collecting real-time performance metrics. In playback mode colmux also communicates with a remote copy of collectl but in this case collectl is playing back a data file collected some time in the past.
Colmux can also provide its output in 2 distinct formats: single-line and multi-line. In single-line format colmux reports the multiplexed data from all systems on a single line by allowing the user to choose a small number of variables to display, based on both the display width and the number of systems. While it is possibly to handle more than a couple of dozen systems, (see the example at the bottom of this page), one rarely does so because of the screen width. However it is also possible to redirect the output to a file for off-line viewing, via a text editor or a spreadsheet.
Colmux has been extensively tested on versions of collectl from V3.3.6 forward and there have been some additional enhancements made to V3.5.0, which is the recommended minimal version. You should first make sure all the systems of interest have the latest versions of collectl installed or at least those at V3.3.6 or newer.
Colmux also provides the ability for dynamic interaction with the keyboard arrow keys if the optional perl module Term::ReadKey has been installed. To see if this is the case and that colmux can find it run with -v and you should see the following:
colmux -v colmux: 3.0 (Term::ReadKey: V2.30)
Restriction |
Colmux requires passwordless ssh between it and all hosts it is monitoring |
The inclusion of a playback filename in the collectl command instructs colmux to run in playback mode and the use of colmux's -cols switch tells it to produce output in single-line format. By using various combinations of these switches you can get colmux to run in any 4 distinct modes as shown in the following table:
Real-Time | Playback | |
Single-line | -cols | -command "-p filename" -cols |
Multi-line | default | -command "-p filename" |
Let's discuss these 4 options separately to give a better feel for what they actually mean and when you might use them. Note that the 2 operational modes have nothing to do with the way the data is displayed and the 2 formats have nothing to do with the way the data is collected - in other words a complete separation between form and function.
Real-Time Mode
If you've ever run collectl before, and you probably have if you're looking at these
utilities, you already know the real-time nature of the tool. The difference here is that
with colmux you're actually able to look at multiple systems at the same time.
By default, colmux runs in real-time mode unless you explicitly instruct it to
run in playback mode by including -p in the collectl command.
Playback Mode
The way you tell colmux to run in this mode is to simply point the collectl command at
a file with the -p switch as you would normally do when you want to play back a file.
The main restriction is that the file needs to exist on all the systems you've pointed
colmux to and therefore wild cards are required for portion of the filename and that
includes the hostname and are often used for the timestamp portion as well.
Typically one simply uses a collectl playback filename in a format something like /var/log/collectl/*20101225*, wildcarding all but the date of interest. If you want, you can put all the collectl files in one directory on the same system colmux is running on, but this method is restricted to only running on the local system.
During playback by colmux, only the data falling in the same time interval will be reported and so the header that reports how many nodes have had their data included becomes more meaningful in case there is missing data.
Multi-Line Format
Once you've decided which systems you want to monitor and what collectl command
you want to execute, you need to decide whether you're interested in single- or
multi-line output. Most users will probably be interested in the multi-line
format, at least at first. Think of the linux top command but not being limited
to just processes.
Multi-line format reports all data provided by collectl in its original format. Further, it sorts it by a column of the user's choice (the default is column 1) and presents as much data as will fit on the screen. The result is a top-like utilility capable of reporting the top consumers of virtually any resource on the cluster be it the more traditional process statistics or something more exotic such as memory or network consumption.
Since this IS native collectl data it can be virtually anything, including that provided by any custom import modules you may have written. You will also see the identical information in playback mode, though this is presented as scrolling text (there is also a -home switch to display the data in top format if you wish, but unless you include -delay it may scroll by too fast to be of use).
The main consideration with multi-line format is that colmux can only deal with collectl commands that themselves only produce single line output OR multiple lines that are all the same format, noting that data provided by a custom import are considered to be a single device themselves. These include:
While the column number to sort on should also be a consideration, and you can manually select it at startup with -column, you can easily change the sort column once colmux is running by either using the arrow keys (if Term::ReadKey has been installed) or simply typing the desired column number followed by the enter key. This works in both real-time and playback mode.
Below is an example of examing the network traffic on 5 nodes and sorting it by column 2. As you can also see, all 5 nodes have reported data for the interval being displayed and colmux highlights the selected column, though not as the bolded text shown in the examples that follow but rather in reverse video:
colmux -addr 'xx1n[1-5]' -command "-sn" -column 2 # Wed Dec 29 05:42:21 2010 Connected: 5 of 5 # <----------Network----------> #Host KBIn PktIn KBOut PktOut xx1n1 9 82 42 326 xx1n2 9 77 41 320 xx1n5 8 75 41 318 xx1n4 8 74 41 317 xx1n3 8 71 40 314
Single-Line Format
Unlike multi-line format which only displays output for the top systems which can change
from interval to interval, in single-line format you always see the selected data
for all systems and it is never sorted but rather reported in a fixed
format. Therefore you need to tell colmux which data fields you're interested in when
you first start it up. To determine the correct column numbers you can either run the
desired collectl command manually and start counting columns, noting the first column is
always 1, or you can use colmux's -test switch, which you can also use in multi-line format.
This switch will display the header line of collectl's output including the hostname as column 0,
with the column(s) you have selected highlighted, as well as a list of all the columns
and their numbers for quick reference.
colmux -command "-sc" -test -cols 3,4 >>> Headers <<< # <--------CPU--------> #Host cpu sys inter ctxsw >>> Column Numbering <<< 0 #Host 1 cpu 2 sys 3 inter 4 ctxsw
Once you have decided on the column numbers, there are a couple of other optional switches you may choose to use, including timestamps, data type totals and for very wide displays you can even request the columns be narrower and to divide each value by 1000 or 1024. To preface each line with a timestamp, you actually include the appropriate time format switch with the collectl command itself, rather than using a distinct colmux switch. caution: when including timestamps the column numbering is shifted appropriately and so you may want to use -test to be sure you're specifying the correct columns.
Here is an example of the same command to look at network data, except in this case colmux has been instructed report data for only columns 3 and 4, to print time stamps at the beginning of each line and to report totals at the far right. As colmux first starts you can see the data being reported as all -1s since those systems have not yet sent any data back:
colmux -addr 'xx1n[1-5]' -command "-sn -oT" -cols 2,4 -coltot #Time xx1n1 xx1n2 xx1n3 xx1n4 xx1n5 | xx1n1 xx1n2 xx1n3 xx1n4 xx1n5 | KBIn KBOut 05:29:48 -1 -1 -1 -1 -1 | -1 -1 -1 -1 -1 | 0 0 05:29:49 2 2 2 2 2 | 0 0 0 0 0 | 10 0 05:29:50 2 2 2 2 2 | 0 0 0 0 0 | 10 0 05:29:51 9 10 4 8 9 | 41 42 3 41 42 | 40 169 05:29:52 2 2 9 2 2 | 0 0 40 0 0 | 17 40 05:29:53 2 2 2 2 2 | 0 0 0 0 0 | 10 0 05:29:54 2 2 2 2 2 | 0 0 0 0 0 | 10 0
The following screenshot is an example of looking at Infiniband traffic between 16 clients writing to 4 lustre servers and even though the font is small, you can still make out the patterns of the column widths changing. The left half of the display shows network received KB and the right half network transmitted KB. The first 4 columns in each section are the lustre servers and the next 16 columns the clients. As expected during a client write test, the lustre servers show high receive traffic and the clients show high transmit traffic. Look how easy it is to see drops in the client transmission rates even if you can't easily read the numbers. Also notice that the second client isn't doing any transmitting at all and since it's not displaynig -1 we know collectl is running correctly.
Here's an even more dense example showing CPU load on a large cluster which is so wide it takes 3 monitors to display it all. Even though you can't read the output you can still see different patterns as some systems start/stop and others sit idle.
updated March 9, 2015 |
Getting started using collectl may seem a little challenging to the new user because of its many options, but it shouldn't be. After all, how many people simply run the top command and don't even realize there are a rich set of options available? In that same spirit, you can simply enter the collectl command and get a lot of useful information, but you would also be losing out on a lot. The intent of this tutorial is to give you a better appreciation of what you can do with collectl and hopefully encourage you to experiment with even more options than those described below.
#<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 30 30 254 65 8 2 7920 97 0 4 0 2 10 10 377 65 0 0 32500 282 4 52 2 19 10 10 332 61 0 0 29312 246 0 3 0 3 9 9 330 65 0 0 32512 275 3 45 1 9 11 11 331 53 4 1 29684 270 0 2 0 2 8 8 352 63 0 0 35004 273 3 33 1 8 13 12 329 116 0 0 28924 249 0 2 0 2
Average transfer rates: 32051995 bytes/sec, 31300.776 Kbytes/secIf we compare the write rates to the number of writes we can also infer writes of about 128KB which is good to know because that means we're being efficient in the size of the data blocks being handed to the driver. However if we don't mind using the extra columns, we can include --iosize, which tells collectl to include the average I/O size when using this default display format also known as brief mode. In verbose mode the I/O sizes are always included.
#<--------CPU--------><---------------Disks----------------> #cpu sys inter ctxsw KBRead Reads Size KBWrit Writes Size 9 8 381 71 0 0 0 30644 276 111 14 13 325 85 0 0 0 32888 258 127 11 10 313 80 0 0 0 31064 261 119 12 11 421 186 0 0 0 32376 276 117
This may also be a good time to mention screen real estate. There is a lot of information that collectl can display and everything takes space! More often than not you don't really care about time and so by default it isn't displayed. However there may be times you do care and so you can simply add the switch -oT add the option of time to the display. In fact, sometimes you may want to include the date as well in which case -oD will do both. You can even show the times in msec by including m with -o, which can be useful when running at sub-second monitoring levels and/or if you want to correlate data to system or application logs with may themselves have finer grained time. Here's an example of the command collectl -scd -i.25 -oDm which shows the cpu and disk loads every quarter second and includes the date and time in msecs:
# <--------CPU--------><----------Disks-----------> #Date Time cpu sys inter ctxsw KBRead Reads KBWrit Writes 20080212 11:22:47.008 2 0 364 84 0 0 31328 284 20080212 11:22:47.258 8 6 392 92 0 0 30832 356 20080212 11:22:47.508 8 6 308 84 0 0 36256 268 20080212 11:22:47.758 2 0 292 44 0 0 31152 196
So what about that CPU load? Given that this is a 2 CPU system we might be interested in seeing how that load is being distributed by running the command collectl -sC, since an uppercase subsystem type, like cpu, disk or network tells collectl to show instance level details:
# SINGLE CPU STATISTICS # CPU USER NICE SYS WAIT IRQ SOFT STEAL IDLE 0 0 0 17 0 0 0 0 83 1 0 0 4 0 0 0 0 96 0 0 0 14 0 0 0 0 86 1 0 0 0 0 0 0 0 100 0 0 0 20 0 0 0 0 80 1 0 0 0 0 0 0 0 100
#<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 38 37 248 189 7283 111 0 0 1 9 1 8 24 23 153 81 32 0 0 0 2 32 1 9
Average transfer rates: 872960833 bytes/sec, 852500.813 Kbytes/secwhich in fact confirms that reads are coming from cache and not disk since no local disk can read at this rate! In general, when doing disk I/O testing one should use file sizes that are larger than cache to force all I/O to come from disk. So repeating the tests with a larger file we now see more realistic read rates:
#<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 9 8 773 743 41376 629 0 0 1 8 1 7 9 8 619 639 31716 476 0 0 2 33 1 8 16 15 510 554 23016 370 0 0 0 4 0 2 10 10 572 624 27272 429 0 0 2 27 1 8 16 15 458 504 19560 306 12 2 0 4 0 2
#<--------CPU--------><-----------Memory----------><----------Disks-----------> #cpu sys inter ctxsw free buff cach inac slab map KBRead Reads KBWrit Writes 3 0 159 80 2G 395M 189M 1M 0 0 0 0 20 3 1 0 153 52 2G 395M 189M 1M 0 0 0 0 0 0 43 42 238 68 2G 395M 340M 152M 0 0 0 0 3060 72 25 25 376 53 1G 395M 431M 242M 0 0 0 0 29808 273 6 6 377 59 1G 395M 455M 266M 0 0 0 0 30900 266 10 10 347 55 1G 395M 492M 303M 0 0 0 0 35004 265 5 4 389 60 1G 395M 506M 318M 0 0 0 0 27308 262
#<--------CPU--------><-----------Memory----------><----------Disks-----------> #cpu sys inter ctxsw free buff cach inac slab map KBRead Reads KBWrit Writes 1 1 374 91 171M 397M 2G 2G 0 0 0 0 34624 288 1 1 368 82 171M 397M 2G 2G 0 0 0 0 31408 260 2 2 319 56 171M 397M 2G 2G 0 0 0 0 31148 266 0 0 385 70 172M 397M 2G 2G 0 0 0 0 25844 273 0 0 167 70 172M 397M 2G 2G 0 0 0 0 0 0 0 0 173 51 172M 397M 2G 2G 0 0 0 0 0 0 2 0 181 108 172M 397M 2G 2G 0 0 0 0 12 2 41 41 148 52 2G 397M 204M 15M 0 0 0 0 72 5
#<--------CPU--------><----------Disks-----------><----------Network----------> #cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi pkt-in netKBo pkt-out 0 0 145 48 0 0 0 0 2 38 2 13 13 13 6716 3491 0 0 0 0 136 682 21144 14762 18 18 6802 3426 0 0 0 0 248 1256 39111 27278 14 14 4680 2420 0 0 28 2 252 1256 40166 28008 7 7 3105 1520 0 0 0 0 148 752 23256 16228
#<--------CPU--------><----------Network----------><------NFS Totals------> #cpu sys inter ctxsw KBIn PktIn KBOut PktOut read write meta comm 19 19 1672 429 1 11 2 12 0 3885 6 0 27 27 8466 12909 1652 20875 56112 39495 0 19383 0 4 9 9 4042 1632 301 3781 10125 7129 0 9508 0 0 7 7 18677 9074 3557 44897 120375 84729 0 0 0 0 8 8 18082 8874 3559 44928 120359 84717 0 0 0 0
So in conclusion you can see there is really quite a lot you can do with just a few basic switches and I haven't even gotten into --verbose, which as they say is an exercise left for the student. So try some simple dt tests yourself or use you own personal favorite load generator, while trying out collectl -sc --verbose or collectl -sm --verbose or even collectl -sn --verbose. You can even put them all together as collectl -scmn --verbose, but then as you'll see you end up using a lot of that valuable screen real estate. As a final bonus, try adding the --home switch which move the cursor to the home (upper left-hand corner) position of the screen. Think of this as something like the linux top command (collectl also has a --top switch for displaying slab/process data) since each sample is displayed at the top of the screen. That command would then look like collectl -smcn --verbose --home.
enjoy...
updated Feb 21, 2011 |
With the release of Collectl Version 3.3.1, one can now send collectl data directly to a ganglia gmond in binary format using the custom export gexpr.ph. This results in several benefits for existing users of ganglia:
As of V3.5 of collectl, the experimental status of this capability has been lifted since enough people are currently using it to verify it works as described.
This configuration assumes one has set up ganglia gmonds in a hierarchy, such that those at the bottom of the tree collect system statistics and send them up to a higher level aggregation gmond and ultimately percolate up to the gmetad which writes the data to a round-robin database.
To use collectl as a data source there are 2 alternatives. In the diagram below at the left you simply have collectl send data to a local gmond to supplement whatever data it is already collecting, noting there won't be any way to record any of gmond's data locally. The diagram at the right would replace all gmonds and have collectl do all the data collection, optionall logging data locally, and sending data to an aggregation level gmond. Whatever method you chose you must ensure the gmond(s) are listening on their udp receive channel and enable the ganglia communications feature in collectl as described in the following sections. There are probably other hybrid configurations which are beyond the scope of this document as well as this author.
![]() | ![]() |
One last component is configuring the rrd to use the metrics being supplied by collectl and the details of that discussion are beyond the scope of this document as well.
The following example shows collectl gathering data on many system components but only sending cpu, disk, memory and network data to a gmond on system gmond using port 8108. It also sends a set of data every 20 seconds while writing the data to a file in plot format to a local file in /tmp every 5 seconds.
This module also supports sending its output to a multicast address, which is used if hostname in an address in the range of 225.0.0.0 through 239.255.255.255. To use this feature you will have to first install IO::Socket::Multicast which in turn requires the module IO::Socket::Interface be installed as well. Note that both these modules may be updated and so you should verify you're actually installing the latest one.
collectl -scdfijmntx --export gexpr,gmond:8108,s=cdmn,i=20 -i5 -f /tmp -P
collectl -scd --export gexpr,192.168.253.168:2222,d=1 07:32:11.004 Name: cputotals.user Units: percent Val: 0 07:32:11.004 Name: cputotals.nice Units: percent Val: 0 07:32:11.004 Name: cputotals.sys Units: percent Val: 0 07:32:11.004 Name: cputotals.wait Units: percent Val: 0 07:32:11.004 Name: cputotals.irq Units: percent Val: 0 07:32:11.004 Name: cputotals.soft Units: percent Val: 0 07:32:11.004 Name: cputotals.steal Units: percent Val: 0 07:32:11.004 Name: cputotals.idle Units: percent Val: 99 07:32:11.004 Name: ctxint.ctx Units: switches/sec Val: 173 07:32:11.004 Name: ctxint.int Units: intrpts/sec Val: 1031 07:32:11.004 Name: ctxint.proc Units: pcreates/sec Val: 4 07:32:11.004 Name: ctxint.runq Units: runqSize Val: 238 07:32:11.005 Name: disktotals.reads Units: reads/sec Val: 0 07:32:11.005 Name: disktotals.readkbs Units: readkbs/sec Val: 0 07:32:11.005 Name: disktotals.writes Units: writes/sec Val: 0 07:32:11.005 Name: disktotals.writekbs Units: writekbs/sec Val: 0
collectl -scd --export gexpr,1.2.3.4:5,d=9,g 05:43:19.003 Name: cpu_user Units: percent Val: 0 TTL: 5 sent 05:43:19.004 Name: cpu_nice Units: percent Val: 0 TTL: 5 sent 05:43:19.004 Name: cpu_system Units: percent Val: 1 TTL: 5 sent 05:43:19.004 Name: cpu_wio Units: percent Val: 0 TTL: 5 sent 05:43:19.004 Name: cpu_idle Units: percent Val: 99 TTL: 5 sent 05:43:19.004 Name: cpu_aidle Units: percent Val: 99 TTL: 5 sent 05:43:19.005 Name: cpu_num Units: CPUs Val: 1 TTL: 5 sent 05:43:19.005 Name: proc_total Units: Load/Procs Val: 141 TTL: 5 sent 05:43:19.005 Name: proc_run Units: Load/Procs Val: 0 TTL: 5 sent 05:43:19.005 Name: load_one Units: Load/Procs Val: 0 TTL: 5 sent 05:43:19.005 Name: load_five Units: Load/Procs Val: 0 TTL: 5 sent 05:43:19.006 Name: load_fifteen Units: Load/Procs Val: 0 TTL: 5 sent
This is an example of what you should expect to see, noting that in this case gmond is still configured to collect its standard metrics, which can be disabled in gmond.conf since you no longer need these.
gmond -d 2 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module udp_recv_channel mcast_join=NULL mcast_if=NULL port=8108 bind=NULL tcp_accept_channel bind=NULL port=8109 Processing a metric metadata message from cag-dl585-02.cag ***Allocating metadata packet for host--cag-dl585-02.cag-- and metric --cputotals.user-- **** saving metadata for metric: cputotals.user host: cag-dl585-02.cag Processing a metric value message from cag-dl585-02.cag ***Allocating value packet for host--cag-dl585-02.cag-- and metric --cputotals.user-- **** saving metadata for metric: cputotals.nice host: cag-dl585-02.cag
updated Feb 21, 2011 |
The mechanism for including custom recording/reporting code into collectl is very similar to that for exporting custom data. One uses the switch --import followed by one or more file names, separated by colons. Following each file name are one or more file-specific arguments which if specified are comma separated as shown below:
collectl --import file1,d:file2
As a reference, a simple module has been included in the same main directory as collectl itself, which is named hello.ph as collectl's version of Hello World. Since it can't read anything from /proc it is hardcoded to generate 3 lines of data with increasing data values. Beyond that bit a hand-waving, everything else it does is fully functional. You can mix its output with any standard collectl data, record to raw or plot files, play back the data and even send its output over a socket.
From time to time additional import modules may be included in collectl which may also be used as reference. For example, the module misc.ph is now also part of collectl. It imports data about the uptime, number of people logged in, the cpu frequency and the number of mounted nfs file systems.
It should be noted that although collectl itself does not use strict, which is a long story, it is recommended these routines do. This will help ensure that they do not accidentally reuse a variable of the same name that collectl does and accidentally step on it.
A couple of words about performance
One of the key design objectives for collectl is efficiency and it is indeed very lightweight, typically using less than 0.2% of the CPU when sampling once every 10 seconds. Another way to look at this is it often uses less than 192 CPU seconds in the course of an entire day. If you care about overhead, and you should, be sure to be as efficient as you can in your own code. If you have to run a command to get your data instead of reading it from /proc, that will be more expensive. If that command has to do a lot of work, it will be even more expensive.
It is recommended your take advantage of collectl's built-in mechanism for measuring its own performance. For example, measuring the performance of the hello.ph module, which does almost nothing since it doesn't even look at /proc data, uses less than 1 second to read 8840 samples on an older 2GHz system, which is the equivalent of a full day's worth of sampling. Monitoring CPU performance data take about 3-1/2 seconds and memory counters take about 7 seconds, just to give a few examples of the more efficient types of data it collects.
Collectl is relatively big, at least for a perl script, consisting of over 100 internal subroutines, most of which are simply for internal housekeeping, but some of which are of a more general purpose. It also keeps most of its statistical data in single variables and one dimensional arrays. Clearly hashes could make it more convenient for passing data around but it was felt that the use of more complex data structures would generate more overhead and so their use has been minimized.
While it is literally impossible to enumerate them all, there are a relatively small number of functions, variables and constants that should be considered when writing your routines to insure a more seamless integration with collectl. The following table is really only a means to get started. If you need more details of what a function actually does or how a variable is used, read the code.
Function | Definition |
cvt() | Convert a string to a maximum number of characters, appending 'K', 'M', etc as appropriate. Can also be instructed to divide counters by 1000 and sizes by 1024. |
error() | Report error on terminal and if logging to a message file write a type 'E' message. Then exit |
fix() | When a counter turns negative, it has wrapped. This function will convert to a positive number by adding back the size of a 32-bit word OR a user specified data width. |
getexec() | Execute a command and record its output to a raw file when operating in collectl mode, prepended with the supplied string |
getproc() | Read data from /proc, prepending a string as with getexec except in this case you can also instruct it to skip lines at the beginning or end. See the function itself for details |
record() | Only needed if not using getproc, which will call it for you. It writes data to a raw file> record when in record mode or calls the appropriate print routines in interactive mode. a single line of data |
Variable | Definition |
$datetime | The date/time stamp associated with the current set of data, in the user requested format, based on the use of -o. See the constant $miniFiller which is a string of spaces of the same width. |
$intSecs | The number of seconds in the current interval. This is not an integer. |
Constants | Definition |
$miniFiller | A string of spaces, the same number of characters as in the $datetime variable |
$rate | A text string that is set to /secs and appended to most of the verbose format headers, indicating rates are being displayed. However, if the user specifies -on with the collectl command to indicate non-normalized data, it is set to /int to indicate per-interval data is being reported. |
$SEP | This is the current plot format separator character, set to a space by default, but can be changed with --sep so never hard code spaces into your plot format output. |
Function | Definition |
Analyze | Examine performance counters and generate values for current interval |
GetData | Read performance data from /proc or any other mechanism of choice |
GetHeader | During playback only, supply the header for additional initialization |
Init | One time initializations are performed here |
InitInterval | Initializations required for each processing cycle |
IntervalEnd | Optional routine, called at end of each processing cycle if defined |
PrintBrief | Build output strings for brief format |
PrintExport | Build output strings for formatting by gexpr, graphite and lexpr, which are 3 standard collectl --export modules |
PrintPlot | Build output string in plot format |
PrintVerbose | Build output string in verbose format |
UpdateHeader | Add custom line(s) to all file headers |
There are also several constants that must be passed back to collectl during intialization. See Init() for more details.
Analyze($type, \$data)
This function is called for each line of recorded data that begins with the qualifier string that has been set in Init. Any lines that don't begin with that string will never be seen by this routine. You should also be sure that string is unque enough that you aren't passed data you don't expect.
GetData()
This function takes no arguments and is responsible for reading in the data to be recorded and processed by collectl and as such you should strive to make it as efficient as possible. If reading data from /proc, you can probably use the getproc() function, using 0 as the first parameter for doing generic reads. If you wish to execute a command, you can call getexec() and pass it a 3 which is its generic argument for capturing the entire contents of whatever command is being executed.
If you want to do your own thing you can basically do anything you want, but be sure to call record() to actually write the data to the raw file and optionally pass it to the analysis routine later on.
In any case, each record must use the same discriminator that Analyze is expecting so collectl can identify that data as coming from this module. You may also want to look at the data gathering loop inside of collectl to get a better feel for how data collection works in general.
To make sure you're collecting data correctly, run collectl with -d4 as shown below for reading socket data, which uses the string sock as its own discriminator. The Analyze routine then needs to look at the second field to identify how to interpret the remainder of the data line.
collectl -ss -d4 >>> 1238001124.002 <<< sock sockets: used 405 sock TCP: inuse 10 orphan 0 tw 0 alloc 12 mem 0 sock UDP: inuse 8 mem 0 sock RAW: inuse 0 sock FRAG: inuse 0 memory 0
GetHeader(\$header)
This function is called when one needs to know what is stored in the header of a file being played back - it is only called if collectl doesn't find it and therefore optional. While it is impossible to know how a module will use GetHeader, it is often used to retrieve instance numbers, such as how many nvidia GPUs one might have been monitored or their type (see nvidia.ph), both of which would have been written to the header using a call to UpdateHeader.
Since standard collectl processing is to always playback what the user requests, even if that data hadn't even been collected, the same holds true here. If one had gotten to this point and it is determined there is no data, the API does not contain a failure return code as does init. Rather, one would simply end up reporting 0s for all values.
Init(\$options, \$key)
This function is called once by collectl, before any data collection begins. If there are any one-time initializations of variables to do, this is the place to do them. For example, when processing incrementing variables one often subtracts the previous value from the current one and this is the ideal place to initialize their previous values to 0. Naturally that will lead to erroneous results for the first interval, which is why collectl never includes those stats in its output. However, if you don't initialize them to something you will get uninitialized variable warnings the first time they're used.
Upon completion return 1 to indicate success, Returning a -1 will indicate failure and result in the imported module's functional calls to be removed from collectl's call stack and will no longer be actively called.
InitInterval()
During each data collection interval, collectl may need to reset some counters. For example, when processing disk data, collectl adds together all the disk stats for each interval which are then reported as summary data. At the beginning of each interval these counters must be reset to 0 and it's at that point in the processing that this routine is called.
IntervalEnd()
As described earlier, if this routine exists it is called at the end of an interval processing cycle. This makes it possible to do any post processing that may be require before the start of the next interval. In many cases this is not necessary.
PrintBrief($type, \$line)
The trick with brief mode is that that multiple types of data are displayed together on the same line. That means each imported module must append its own unique data to the current line of output as it is being constructed without any carriage returns. Further, since there are 2 header lines and brief format supports the ability to print running totals when one enters return during processing, there are a number of places one needs to have their code called from.
PrintExport($type, \$ref1, \$ref2, \$ref3, \$ref4)
What about custom export modules and how this effects them? The good news is that at least for the standard 3 modules, lexpr, gexpr and grphite all support --import. In other words they too have callbacks that you must respond to if your code is being run at the same time as one of these.
Again, see hello.ph for an example, but suffice it to say you need to do something when called, even if only a null function is supplied.
lexpr can write its output to the terminal and do the easiest way to test this is to just run collect and have it display on the terminal. However, the output of gexpr and graphite is binary and so the easiest way to test this code is to tell them not to open a socket (though you must supply an address/port for gexpr, even if invalid) and print the data elements they are about to send to the terminal by running with a debug value of 9 noting this is gexpr's & graphite's own internal debugging switches and not collectl's. The 8 bit tells them to not open the output socket and the 1 bit tells them to display their output nicely formatting on the terminal.
collectl --import hello --export gexpr,1.2.3.4:5,d=9 Name: hwtotals.hw Units: num/sec Val: 140 Name: hwtotals.hw Units: num/sec Val: 230 Name: hwtotals.hw Units: num/sec Val: 320
PrintPlot($type, \$line)
This type of output is formatted for plotting, which can get quite complicated based on whether you are writing to a terminal, multiple files or a socket. Fortunately all that headache is handled for you by collectl. All you need to do is append your summary or detail data to the current line being constructed, similar to the way brief data is handled. Since it has to handle both headers as well as data, there are 4 types included in the call.
PrintVerbose($printHeader, $homeFlag, \$line)
Like PrintBrief, this routine is in charge of printing verbose data but is much simpler since it doesn't have to insert code into the middle of running strings.
UpdateHeader(\$line)
updated Feb 21, 2011 |