Subject: "Bounty proposal: `preload'ing the desktop"


WHO

   Behdad Esfahbod
   http://behdad.org/
   behdad behdad.org


NAME

   preload - an intelligent readahead agent


SPEC VERSION

   1.0, June 2005


SUBMITTED TO

   Google Summer of Code, June 2005


MENTORING ORGANIZATION

   Fedora Core


SYNOPSIS

   preload is an intelligent agent that monitors applications that a
   user runs, and from analyzing this data, predicts what applications
   the user might run in the near future, and fetches those binaries
   and their dependencies (shared objects) into memory for faster
   startup times.  When run by the root user, preload takes into account
   probabilities that individual users may login now, and proceeds with
   prediction for faster login times.  preload is desktop-environment
   agnostic.


GOALS

   The goal here is to make application startup times shorter.  We
   exploit the fact that most users have a very static set of
   applications that they use on a regular basis.  For me for example,
   these applications are gnome-terminal, firefox, gaim, xmms, and skype.
   I start gnome-terminal and firefox as soon as I login and start
   gaim and xmms soon after.  Add to these the GNOME desktop
   processes like nautilus, gnome-panel, panel applets, and shared
   objects they depend on, and you get a few hundred megabytes of
   binary objects that can be happily prefetched into page cache
   when GDM is waiting for me to login.  preload makes this happen.
   That should drastically decrease the login time, and may have
   positive effects later on too.


PROJECT DETAILS

   There are two main parts to preload, the data gathering agent,
   and the predictor agent.  The two agents may or may not be
   implemented in the same process.

   The data gathering agent is started when a user logs in.  A good
   place to start it is to drop it into /etc/X11/xinit/xinitrc.d.
   When started, it will gather information about running
   applications regularly, say once a minute, load permitting.  The
   list of running applications is taken from the output of
   xlsclients(1), and for each application, the list of its maps is
   fetched from /proc.  The latter can be performed lazily and
   results would be cached:  There's no need to fetch maps for xmms
   every single minute, only once per run.  Anyway, this data is
   gathered and sorted out and dumped into a database (a binary file
   format) for use by the predictor agent.  There are a bunch of
   ways to make the database compact in size.  Data acquisition can
   be stopped when screensaver is detected to be running, to not
   credit the running applications too much during idle times.

   The predictor agent is an intelligent agent that upon reading the
   gathered data and the list of currently running applications,
   predicts which application is most probably going to be run first
   in the coming minutes.  We model the problem as a Bayesian
   system, and assign probabilities to each application, and then
   considering these probabilities and shared objects that each
   application depends on, we sort the shared objects in a priority
   list to be readahead, and perform that.  Enough care is taken for
   the agent to not get in the way of the user.  It monitors user
   activity, system load, data and page cache memory size, etc, before
   deciding to jump in.

   It is worth noting that the predictor agent has an exponentially
   fading memory, such that if a user's habits change and he starts
   using epiphany instead of firefox, after a few days the predictor
   understands that although firefox has been running every single
   session for a few months now, but that's part of history and
   epiphany is to be expected more often.

   The problem can be seen as an stochastic Markov chain whose
   states are members of the power set of the set of all applications
   that the user may run.  So for example the Markov chain would tell
   us that most of the time the user starts at the state corresponding
   to the set of running applications {gnome-session}, and from that
   state almost always goes to the state corresponding to the set of
   running applications {gnome-session, gnome-settings-daemon,
   metacity, gnome-volume-manager, gnome-panel, nautilus, eggcups,
   wnck-applet, gweather-applet, stickeynotes-applet,
   multiload-applet, clock-applet, ...}, and from that state, most
   of the time goes to the state that has the same set of
   applications union {firefox, gnome-terminal}, or some other times
   simply {xmms}...  If we can build this Markov chain, we can
   proceed with predictions, but training this Markov chain based on
   our training data is something better not tried.  This model is
   too complex for the problem and will surely overfit.  To overcome
   this shortcoming, we make independency assumptions and model
   every two applications in a four-state Markov chain.  So for
   example we have a Markov chain for the pair of applications
   firefox and gnome-terminal, another for firefox and gnome-panel,
   another for firefox and xmms, and each of these Markov chains
   models the correlation between the state of the two applications
   involved.  When we want to predict the probability that firefox
   is run, we simply multiply (using log-probabilities, we add) the
   individual probabilities from individual Markov chains that
   firefox forms with all other involved applications.  For example,
   if gnome-terminal is run but not firefox, the Markov chain
   modeling these two applications will tell us that most of the
   time when gnome-terminal is launched, firefox is launched soon
   after...

   The above description is a simplification of the model I have in
   mind.  In the more sophisticated model, it is not only a single
   probability that is assigned to edges in the Markov chains, but
   the time that takes for the state to change is fit with a
   probability distribution, so we can have an idea whether the
   state change is going to happen soon, and we take this into
   consideration when predicting.  The probability distribution that
   we fit most probably would be a memory-less distribution, Poisson
   is a good candidate.  Finally, the whole model is subject to
   change, should experiments show that it falls short.

   When preload is run as the root user (by GDM or another DM, or
   during the boot process), it takes into consideration the
   probabilities that each user may show up and login, and from
   there continues with normal prediction.  Again, being a Bayesian
   system, it doesn't pick one user, but all active users, with
   probabilities.  So for example if both active users of the
   machine run GNOME desktop and firefox, these applications and
   their dependencies will be fetched first, and after that
   applications used by the most active user of the machine.

   To summarize, preload is a generic readahead agent.  It helps the
   page cache warm up in a learnable manner.  A GNOME-specific
   readahead agent can be written without digging into machine
   learning, by simply looking at what gnome-session is going to
   run, but the approach proposed here works for all applications
   and desktops and even a mixture of them.  I believe that an agent
   like what is proposed here can drastically decrease the login time,
   should it have had a few seconds to fetch stuff before the user
   logs in (but that is almost always the case, be it the time
   the user is typing user/password, or simply the login screen is
   sitting there waiting for the user to come back.)  However whether
   it has any measurable effects after the login peek is passed and
   the cache is warmed up is something to wait and see.  But then again,
   decreasing the login time is enough a motivation to experiment.
   The bad news is that simply monitoring what applications are
   currently running is too small information to predict what the user
   is going to run next, but this level of genericity is one of the
   design goals and should not be sacrificed.


PROJECT SCHEDULE

   A prototype will be coded using Bash and Python scripts, using
   the basic model described here.  This should not take more than
   two weeks.  For the next five weeks the model will be improved,
   until the desired level of prediction is achieved.  Alternate
   models will be examined in this period.  After working out the
   model and not later than eight weeks from the beginning of the
   project,  an efficient implementation in C will be started and
   finished in the remaining time.


TARGETED BIO

   I am a Graduate student in Computer Science.  Since I passed a
   graduate Machine Learning course last Fall, I have been thinking
   about using ML techniques in Open Source software.  When thinking
   about ways to reduce the login time, I came up with the ideas
   presented here, and I think they are worth giving a try.  I have
   been programming since 1990, and developing Open Source software
   since 2000.
