-*-Mode: outline-*-

* Introduction

This is a snapshot of q-tools, which currently includes a system-wide
profiler (q-syscollect), a gprof-style profile visualizer (q-view),
and a call-graph visualizer (q-dot).  The current snapshot has many
rough edges and I plan to round them off over time.  However, since
the tools can be quite useful already and since I won't be able to
work on them for a (short) while, I thought it's better to release
them rather than to sit on them.  Also, since q-tools is released
under the GPL, everybody is welcome to submit improvements.

Quick example:

======================================================================
$ q-syscollect -k -t 10   # collect kernel activity for 10 seconds
$ q-view kernel-cpu0.info # display the profile for CPU 0
Flat profile of CPU_CYCLES in kernel-cpu0.hist#1:
 Each histogram sample counts as 1.00057m seconds
% time      self     cumul     calls self/call  tot/call name
 86.76      8.64      8.64         -         -         - cpu_idle
 13.10      1.30      9.95     1.12G     1.17n     1.17n default_idle
  0.02      0.00      9.95         -         -         - sys_select
  0.02      0.00      9.95         -         -         - e1000_intr
    :
Call-graph table:
index %time      self  children         called     name
                                                       <spontaneous>
[6]   100.0      8.64      1.30         -          cpu_idle
                 1.30      0.00     1.12G/1.12G        default_idle [7]
----------------------------------------------------
                 1.30      0.00     1.12G              cpu_idle [6]
[7]    13.1      1.30      0.00     1.12G          default_idle
----------------------------------------------------
                                                       <spontaneous>
[0]     0.0         -      0.00         -          do_IRQ
                    -      0.00     25.2k/25.2k        note_interrupt [5]
                    -      0.00     25.2k/25.2k        lsapic_noop [4]
                    -      0.00     25.2k/25.2k        handle_IRQ_event [1]
    :
======================================================================

* Prerequisites

q-tools has only been tested on Itanium 2 (McKinley, Madison and Montecito).
The tools may work on Itanium 1, but even if they do, only limited functionality
will be available (in particular, call-count sampling isn't supported on
Itanium 1, due to hardware limitations).

The following software must be present for q-syscollect:

 - Linux kernel v2.6.5 or later
	- CONFIG_PERFMON must be enabled
	- CONFIG_KALLSYMS must be enabled to get kernel symbol names
 - libpfm 3.2-060512 or newer. You can get a copy at http://perfmon2.sf.net.
          (Debian package libpfm3-dev; package pfmon is also recommended)
 - glibc v2.3.2 or later (the only thing needed here is sched_setaffinity())

The following software must be installed for q-view/q-dot:

 - guile v1.6 (e.g., Debian package guile-1.6)

Note: Since guile has good support for large numbers, it is perfectly
      OK to analyze profile data collected on a 64-bit machine with a
      32-bit machine.

* Building

After downloading the tar ball, extract it with

 $ tar xzvf q-tools-VERSION.tar.gz

build it with:

 $ cd q-tools-VERSION; make

and then install as root with:

 $ make install

By default, q-tools are installed under /usr/local.  To use a different
target directory, use:

 $ make install prefix=PREFIX

where PREFIX is the desired target prefix (e.g., /usr).

If the above commands all succeeded, you should then be able to run
the commands given in the "Quick example" and get similar results.

* q-syscollect

This program uses statistical techniques to collect both the
execution-time profile and the call-graph of the programs running on a
particular machine.  The duration for which profiles are collected
can be specified via the -t argument, which takes a time in seconds,
or by specifying a command to execute.  For example,

 $ q-syscollect gcc test.c -o test

would profile _all_ programs (including kernel-mode execution) while
"gcc" is compiling test.c.  Profiling can be limited to user-level
and/or kernel-level execution with the -u and the -k options.

The collected profiles are stored in separate files in a subdirectory
called ".q".  The filenames in this directory have the format:

  PROGNAME-pidPID-cpuCPU.TYPE#VERSION

where:

  PROGNAME:	name of the program for which this profile is for
  PID:		the process id for which this profile is for
  CPU:		the CPU for which this profile is for
  TYPE:		one of "info", "hist", or "edge"; "info" contains
		general profile information, "hist" the execution-time
		histogram and "edge" the call-graph profile
  VERSION:	a sequential version number, get's incremented whenever
		there would be a filename collision otherwise; for example,
		with the NPTL thread-library, each thread in a multi-threaded
		program would get a separate version number

A sample filename might be:

  emacs-pid22148-cpu0.info#0

Note that there a separate files for each CPU that a program executed
on.  In other words, if there are lots of CPUs and the program
migrates often from one CPU to another, you'll get lots of profiles!
There are currently no tools to merge profiles, but that's obviously
one area of future enhancements.

In addition to the normal profile files, the .q directory also contains
hidden files of the form:

	.FILENAME.crc32.CRC

where:

  FILENAME:	the name of a file that is in some way needed in order
		to analyize the profile data; for the kernel, this
		is a copy of /proc/kallsyms, for normal programs, each
		shared library and program file gets such an entry
  CRC:		the CRC32 checksum of the file's contents

by default, the .q directory only stores links to the underlying files
(except for /proc/kallsyms, which is always copied to .q to ensure
availability of the proper symbols even when the machine gets rebooted
with a different kernel).  However, you can force the copying of such
files by setting environment variable like so (assuming Bourne-shell
syntax):

	Q_COPY_METHOD=copy
	export Q_COPY_METHOD

This is useful when you expect to collect profiles on one machine and
analyze them on another.  By forcing copying, you can simply copy the
entire .q directory to the analyzing machine and be assured that you
got the right set of files to analyze the data with.

Storing the checksum as part of these filenames serves two purposes:
first, it ensures that filename collisions do not occur (i.e., we
don't have to store the entire path of a file) and, second, it ensures
that we have to maintain only one copy of each unique file which keeps
disk-space consumption in check.  (Of course, with a good checksum, it
would be sufficient to use just the checksum as the filename, but
including the original filename can be helpful to get an idea what all
those files are for.)

** Blind-spot-free profiling

Starting with version 0.2, q-syscollect supports the -i option, which
enables blind-spot-free profiling.  When turning on this option,
q-syscollect is able to profile execution time even in kernel code
which has interrupts disabled (masked).  In fact, it can even monitor
time spent in very low-level handlers, such as the software TLB-miss
handler or the time spent executing in firmware.

There are several caveats to using this option at the moment:

 - Profiling granularity is reduced to basic-blocks.
	If you just use q-view to analyze the profile, this won't affect
	you.  However, if you're interested in knowing exactly which
	instruction causes stalls, then turning on -i won't help you
	because accuracy is limited to a (dynamic) basic-block.

 - A kernel-patch is needed to get quantitatively accurate results.
	If you just want to get a rough idea of whether the kernel is
	spending lots of time in blind spots (such as
	interrupt-handlers), just turning on -i for q-syscollect
	should be sufficient.  However, if you want to get
	quantitatively accurate results, you should apply the patch
	from the q-syscollect/kernel-patches/ directory that is
	closest to your kernel version and turn on
	CONFIG_IA64_INTERRUPTION_PROFILING.  Otherwise, some of the
	time spent in interrupt handlers could be misattributed.
	Applying this patch slows down kernel execution a bit.
	Microbenchmarks indicate that the slow-down is on the order of
	10-20 cycles, so usually this is in the noise.

 - Only a flat profile is collected.
	Blind-spot-free profiling uses the same hardware as q-syscollect
	normally uses for collecting the call-counts.  Thus, when turning
	on the -i option, q-syscollect will not collect call-counts and
	you'll only get a flat profile, not a call-graph/call-counts.
	In the future, we may multiplex the hardware such that this
	deficiency is eliminated.

	Having said that, note that the call-counts which q-syscollect
	obtains are valid even for a profile obtained with -i.  In
	other words, you can combine the call-counts of a q-syscollect
	run without -i with the flat profile of a q-syscollect run
	with -i and get a valid profile, assuming that the measured
	workload didn't change significantly.

 - Profiling frequency must not be too high.
	The hardware is limited to collecting a single sample while
	interrupt-delivery is masked.  To get accurate results, it is
	recommended to choose the sampling rate lower than the inverse
	of the maximum expected time during which interrupts are
	masked.  For example, if you expect that interrupts are
	disabled for at most 10ms, then the sampling rate should be no
	greater than 1/10ms = 100 Hz.  Note that, by default,
	q-syscollect uses a sampling rate of 1000 Hz.  Use the -C
	option to set the desired code-sampling rate.

Here is an example of the difference the -i option can make.  In a
test-program that does nothing other than repeatedly send itself a
signal, which it handles in an empty signal-handler, the first few
lines of an ordinary q-syscollect profile might look like this:

% time      self     cumul     calls self/call  tot/call name
 25.05     12.44     12.44     88.8M      140n      140n _spin_unlock_irq
 17.00      8.44     20.87     59.8M      141n      141n _spin_unlock_irqrestore
  8.54      4.24     25.11      231M     18.4n     18.4n __copy_user
  8.36      4.15     29.26         -         -         - break_fault
  7.52      3.73     33.00     89.0M     42.0n     42.0n __do_clear_user
  3.84      1.91     34.90     29.3M     65.1n      307n setup_sigcontext
  2.85      1.42     36.32     29.8M     47.6n      381n setup_frame

This profile looks rather obviously suspicious, since it's unlike that
so much time would be spent in routines which unlock an spinlock.

If we collect the profile by adding "-i", the first few lines of the
profile look like this:

% time      self     cumul     calls self/call  tot/call name
  6.49      3.69      3.69         -         -         - __copy_user
  6.16      3.50      7.18         -         -         - __do_clear_user
  6.08      3.45     10.63         -         -         - recalc_sigpending_tsk
  5.90      3.35     13.98         -         -         - rse_clear_invalid
  4.21      2.39     16.37         -         -         - break_fault
  4.00      2.27     18.65         -         -         - __dequeue_signal
  3.30      1.87     20.52         -         -         - setup_sigcontext

This looks a lot more reasonable already.  If, in additional, we apply
the kernel patch to get quantitatively accurate blind-spot-free
profiles, we get this result:

% time      self     cumul     calls self/call  tot/call name
  7.17      4.05      4.05         -         -         - __copy_user
  6.63      3.75      7.80         -         -         - break_fault
  6.31      3.57     11.37         -         -         - __do_clear_user
  5.80      3.28     14.65         -         -         - recalc_sigpending_tsk
  4.75      2.69     17.34         -         -         - rse_clear_invalid
  3.89      2.20     19.54         -         -         - __dequeue_signal
  3.23      1.83     21.37         -         -         - ia64_leave_kernel

The profile got refined some more (e.g., the time attributed to
break_fault is higher now) but the difference is much less dramatic
compared to the "no -i" vs. "-i" profile.  Still, to get accurate
results, we definitely recommend using the kernel patch.

* q-view

This program is a Scheme script (in the guile dialect) which takes the
profiles generated by q-syscollect (or Hans Boehm's qprof) and
generates gprof-style text output.  The syntax for invoking this program
is simply:

 $ q-view PROFILENAME

where PROFILENAME may be the name of any ".info" file generated by
q-syscollect.  A detailed explanation of each field is given in the
last part of the output produced by q-view.

Example:

 $ q-view .q/emacs-pid22148-cpu0.info#0

* q-dot

This program works exactly like q-view, except that it produces a
graphical version of the call-graph in the form of a "dot" file.  The
dot program can then be used to translate the call-graph into a number
of formats.  For example, the following command would translate the
call-graph for the above-mentioned emacs profile into a PostScript
file and render it with the GhostView (gv) program:

 $ q-dot .q/emacs-pid22148-cpu0.info#0 | \
    dot -Tps -Gcenter,rotate=90,size=10.75 | \
     gv

* q-grab-mappings

This is a utility which reads the list of mapped files for the process
with process id PID from /proc/PID/maps and then dumps the mappings in
the .info-format used by q-tools.  This is useful in combination with
the -m option of q-syscollect when profiling a short-lived program.
For example, to measure start-up overhead of a short-lived program,
you can invoke the program repeatedly (e.g., 1,000 times) while
running q-syscollect.  With the -m option, this will result in a
"unknown" .info file that has the 1,000 invocations aggregated.  You
can then run the program one more time and stop it once it has mapped
all it's libraries.  Usually, this is most easily done with gdb and
setting a breakpoint at "exit".  When the program is stopped, lookup
it's PID, then do:

	 $ q-grab-mappings PID

and append the output to the "unknown" .info file.  You can now run
other tools such as q-view or q-dot and get a useful profile.

Caveat: when invoking the short-lived test program, you should try to
avoid running any other short-lived programs at the same time, since
those would pollute the data gathered in the "unknown" profiles.
