

                   RRR Sorority Division presents 

                               PIMPPA 
 
                        v0.5.8 / 07.mar.2006


                    http://pimppa.sourceforge.net/


1. Legal junk
-------------
See file COPYING


2. What is this anyway
----------------------
Imagine the unfortunate situation where much more stuff
(and even more crap, spam, etc) comes from the Net than you are 
bothered to handle manually. Shall we say, several hundred 
files daily, totalling to over hundreds of thousands of files 
in the long run. Direct net connection and access to an
adequate news server easily provides such conditions.

PIMPPA is designed to help in such a case, to minimize the required
human interaction, striving for complete "hands free" operation,
where you could once a week (for example) drop by and check out
the already filtered and polished catch. The system is suited 
for all types of files, but it was first designed with pictures 
in mind. :)

PIMPPA is able to automatically fetch and process files from
newsgroups and FTP sites, while performing the necessary
decoding, duplicate discarding and further file processing tasks,
like classifying: files downloaded by pimppa tools can be 
automatically deposited to appropriate directories based on 
their filenames.

PIMPPA also provides nifty command line interfaces for 
file access and backup operations. In addition, there's a
Gnome GUI.


3. Requirements
---------------
See file SETUP.

From the file you see that PIMPPA depends on much external stuff. 
Its because there is no reason to reinvent the wheel. I think it 
better to use the best applications available for specific tasks 
(read: I'm lazy). When the external programs get better, 
PIMPPA gets better. If completely new programs surface 
(performing similar tasks) they could be easily integrated 
into the system.


4. Install & Upgrade
--------------------
See file SETUP. "Third floor, second door on the left". 
Isn't bureacracy such a nice thing? In this case the standard
GNU file INSTALL is not of much use. :(


4.1 Uninstall
-------------
There's a script "uninstall_db" in "extras/" to destroy all
pimppa related stuff from MySQL databases. Use that, and
then remove the pimppa installation, e.g. use "make uninstall" 
in the source distribution main dir. Other distributions
than the source might require distribution specific uninstallation.


5. How to use
-------------
Here's some examples and hints.


5.1 Getting started on newsgroup sucking
----------------------------------------
(If you dislike hot command line action, you might
want to use the GUI "bowser" instead of following these
instructions. You'll still have to do things in the
same order: Settings -> Preferences -> Add some fileareas, 
add/modify news server, add some newsgroups. After 
that you can probably leech.)

To leech stuff from the news, you should first tell
the system the hostname of your primary newsserver.

shell> pnewsrv -s <your.ultrasmutnews.com> -i 1

Then we need to setup some destination (incoming!) 
fileareas for the files to be added on when downloaded. 
Use the command:

shell> pnewarea -i -n <area_name> -p <directory_path>

As you now have a couple of fileareas, assign some newsgroups
to them. Any number of newsgroups may be assigned to a single
area.

shell> pnewgrp -n <newsgroup_name> -d <area_name>

Ok, after this tedious configuration phase we can finally
leech.

shell> pleech -v

If it succeeds, you should have several files decoded and
added to the destination fileareas. Most of them will be crap, 
but lets pretend that there's some famous Goatlord-stuff 
(pictures) called "blah*.jpg" there you wish to keep.

There's not yet any location for Goatlord's material, 
so you would want to create a new filearea for them,
a "non-incoming" area which is meant only for quality stuff.

shell> pnewarea -n goatlord -p /stuff/goatlord/

Now you have a new filearea called "goatlord". Enter the
directory where you got the "blah*" files and execute
"pmv blah* goatlord". Files will be moved, and in future all
files fetched by "pleech" matching a special assign pattern
(see 9.2) created for "blah*" will be automatically moved to the 
"goatlord" filearea, and not to the newsgroup's default
destination.

If rest of the files you got were junk, you can delete them
(or to be really sure that you'll never see them again in
the future, pmv them to 0DOOM). See chapter 7 for details. 

Now it might be a good idea to make "crond" execute 
"pleech -q" nightly. :) In that case, remember to set
the crontab variable HOME correctly to point to a directory 
where there's a .my.cnf to provide you access to the db.


5.2 Files from other sources
----------------------------
You can just normally "mv" the files to some directory 
assigned as a PIMPPA-filearea. Then use "padopt" to add them 
to the PIMPPA database. Later, those files are just as any which
were sucked from the news. Note that adoption will do some
duplicate checking and may discard files.

But suppose you download a lot of unsorted crap to a single
directory, which is not a PIMPPA filearea. Then you can use

shell> padopt -m <area_id>

where <area_id> is the default destination location for
these files. WARNING: all unsuitable files from the
current directory will be deleted (dupes, invalidly named, etc)
and the rest is moved according to knowledge, or to default.


5.3 Viewing & backupping files
------------------------------
See chapter "Scripts" for quick descriptions of some
viewing interfaces.

When your harddisk starts to burst with all the stuff you have
downloaded, it might be time to store them elsewhere. See
chapter "Utilities" for util called "pbackup". 


5.4 Manual duplicate checking
-----------------------------
While surfing the web, you encounter some tasty files, 
named "blood_*.jpg". Unfortunately, you can't remember if
you already have them. (It might be hard sometimes). 
Easy, util "pf" is just for that purpose. Use

shell> pf -p blood%

and you get the result rather quick even if you had over
100000 files on your PIMPPA system.

If the files are online, you could view them just as speedily
by using "pv_name", e.g.

shell> pv_name blood%


5.5 Changing preferences
------------------------
PIMPPA stores most of its preferences in table p_misc,
reminiscent of infamous registry on windoze. Utilities
will look for (key,value) pairs from the database and 
use them if found. Otherwise compile time defaults are
used. Commandline values override both.

You can set preferences using "pcfg -k <key> -d <value>"
and view them by just "pcfg". You can also use the GUI,
see the next chapter.

Keys suitable for modification are:

CFG_BOWSERSINCE

    Value: Number. How many days since does bowser display
    the files on startup. Default: 1

CFG_FILETYPES

    Value: String. Extended RegExp pattern of allowed filetype 
    extensions. When downloading newsgroup articles, we try to 
    match any of these extensions from the Subject: line, and 
    find out the filename that way. Example:
    
    \.jpg|\.png|\.gif|\.mpg|\.avi|\.asf|\.rar|\.r[0-9][0-9]|\.s[0-9][0-9]

CFG_MINASSIGNNAMELEN

    Value: Number. Filename must have this many characters
    before the extension for the assign pattern (see 9.2) to be 
    created/noted at all. Short patterns tend to go wrong often. 
    Default: 4 (means: don't create assign pattern for e.g. file a10.jpg)
    
CFG_NEEDNUMBERS

    Value: Number. Require this many numbers in all filenames,
    before the extension. Files having less will be skipped. 
    (example: blah123.r01 == 3 numbers) 
    Default: 0. My suggestion: 2

CFG_NEEDOTHERS

    Value: Number. Require this many non-number characters in all
    filenames, before the extension. Files having less will be skipped.
    (example: blah123.r01 == 4 others)
    Default: 0. My suggestion: 2

CFG_NEWSWAITAFTER

    Value: Number. After leeching this many articles, wait 
    CFG_NEWSWAITSECS (to allow the server to catch its breath). 
    Default: 10

CFG_NEWSWAITSECS
    
    Value: Number. After leeching CFG_NEWSWAITAFTER messages,
    wait this many seconds. Default: 2
    
CFG_NOSPACE
    
    Value: Boolean. Do NOT accept any filenames containing spaces. 
    1 == TRUE, 0 == FALSE. Default: 0. My suggestion: 1.

CFG_STRICTMD5
    
    Value: Boolean. In case of MD5 collision, delete the newer file. 
    1 == TRUE, 0 == FALSE. Default: 1.

CFG_TMPDIR
    
    Value: String. The directory where pimppa keeps its temporary
    files, like uucoded material downloaded from the news. 
    This directory should usually have atleast several hundred 
    megabytes of free space. Default: "/tmp/"

CFG_TOLOWERCASE
 
    Value: Boolean. Convert incoming filenames to lowercase. 
    1 == TRUE, 0 == FALSE. Default: 1

CFG_VIEWER

    Value: String. Use this viewer executable/script to view files 
    in bowser and from scripts. It must view all files in the current 
    directory. To use e.g. feh, set it to "feh -F -Z -d -S filename -r ."
    Default: gqview


5.6 Advanced: Examples for a collector
--------------------------------------
In the wide world there might be just a few series of files you want,
but you just don't bother to personally seek them out. There are
two styles of REGEXP rules that can be used (and defined by "bowser"
by altering p_rules table with Preferences-menu). The other
type defines patterns that must be matched in filenames in
order to accept/reject the files and the other type defines
patterns to be matched in message subjects, for similar
purposes.

For usage instructions, see section 9.1.


5.7 Advanced: Using suckkillfile
--------------------------------
You can use a suckkillfile with pimppa. In manual installation,
pimppa will typically look for the file from "/usr/local/share/pimppa/".

When leeching article headers, suck will download only those
headers which pass the killfile. Others are discarded, and
pimppa won't ever see them. To put it another way, suckkillfile 
puts a lot of potential spam-prevention power at your disposal.

See "man suck" for details of how to use the killfile.

    
6. GUI
------
PIMPPA has a Graphical User Interface called "bowser", 
which lets you download newsgroups and perform various 
operations on your fileareas, files, assign patterns, 
file types, newsgroups and other settings. The GUI usage 
should be self-explanatory. Most operations (like View/Delete/Move) 
are accessed by pressing the right mouse button on a list window.

NOTES:

1) GUI uses "gqview" to view files. It can be changed
   from Settings/Preferences/Miscellaneous. 
2) Warning: "View", "Delete" and "Kill" operations automatically 
   affect all selected items! They do not ask confirmations.
3) When downloading newsgroups from multiple servers, 
   a selected group is always leeched from all servers its 
   defined for.

Most of the really powerful operations of PIMPPA are 
done using plain shell utilities and scripts - without 
fancy user interfaces. :) So after you get bored with 
the GUI (its a bad hack), take a closer look at the 
commandline stuff.


6.1. Setting preferences from GUI
----------------------------------
All fields in prefs reflect their respective SQL tables
and table contents.

With a little thinking and reading this file you should 
be able to get 'em right. Remember that you have to
restart the GUI for some of the "Miscellaneous" preferences
to take effect.

NOTES:

1) When adding fileareas with the GUI, remember
   to include "/" to the end of the directory path.

2) Remember to set all areas which newsgroups are directed
   to as INCOMING.


7. Utilities
------------
    There's plenty of these little bastards. The most important 
    are probably pleech, pmv, prm and padopt.
            
    Some of these utilities have been integrated to the GUI, 
    but unfortunately with no ability to adjust their parameters.


    padddir (obsolete at the moment, see Note)
    -------
    This adds the contents of the current directory to filearea 0.
    It can be used to prevent the stuff there from bothering you
    ever again (they will be skipped if re-encountered in future).

    Usage:  padddir [-v]             
    
            -v              Verbose execution

    Note: Perhaps a better way to do this would be to move
    the files to some 'dump' filearea, add them there with
    'padopt' and then kill them with 'prm'. That would produce 
    a couple of assign patterns instead of keeping all the filenames 
    in the database. Result is more general: all similarly named 
    files will be deleted in the future. You can also move
    unwanted files to 0DOOM. This keeps the MD5 checksums of
    the files as well (preventing renamed versions of the
    same files surfacing again).


    padopt
    ------
    This goes through all the PIMPPA fileareas and adds the
    files that are found in the dirs, but not in the database, 
    to the database.
    
    Usage:  padopt [-v]
   
            -a <pattern>    SQL area name pattern to operate on, def=all
            -m <area_id>    Move files from current directory (unless
                            dest. for file is known, move to <area_id>)
            -v              Verbose execution
    
    -m switch moves files matching assign patterns from current 
    directory (which shouldn't be any filearea!) to fileareas and
    the rest to area_id. WARNING: Unsuitable files and duplicates 
    are deleted!

    Note: padopt updates assign patterns.


    passign
    -------
    This browses through all your files and automatically recreates
    the "assign patterns" for them. See section 9.2 for explanation.

    Usage:  passign [-c]

            -c              Delete all previous patterns. Does not
                            affect patterns with destinations 0 or -1.
            -f <filename>   Show current assign for <filename>
            -r              Refresh (recreate) all assign patterns
            -0              Delete all assign patterns with target area 0.
            -n              Delete all assign patterns with target area -1.
            -t <area_id>    With -f, set a new target area id instead

    Note: Assign patterns are not created or used for 
    contents of AREA_INCOMING or AREA_NOASSIGN areas.


    pbackup
    -------
    The backup util.

    Files which have "file_backup==0" in database [or the value specified
    with "-r" on the commandline] will be included in the backup.
    Backuped files will be given a new file_backup ID [or the one 
    specified with "-r"] ... 

    Files will be included until media size is reached. Media
    size can be specified on the commandline (in Megabytes).
    In normal operation the next backup will start from the
    area where the last backup operation ended.

    A shadow directory of the backuped files is created
    to a specified location ("-p path") along with a filelist.
    Default is "/tmp/test/". You can then create an ISO-image 
    out of the created directory structure, stream it to 
    tape or whatever.
    
    e.g. after pbackup
    shell> cd /tmp/test
    shell> mkisofs -J -a -d -D -L -R -r -f -o /tmp/cd.img -V JUNK03 .
    shell> cdrecord -data /tmp/cd.img
    
    And to get rid of the backuped files from HD, use

    shell> pclean -i <backup_id>
   
    Note1: Without "-n", this utility tries to fill the backup media 
    to the brim. In any case, it can't know which files should 
    go together. For example, disk archives belonging to the 
    same product may end up on different backups.

    Note2: After using this utility, its suggested to do atleast 
    "du -L" in the destination directory to see that everything
    went ok.
   
    Note3: Contents of INCOMING areas will not be backupped.
    
    Usage:  pbackup [options]

            -n             Be nice, stop at first file not fitting on media 
            -p <path>      Path to make the shadow directory into
            -r <ID>        Specify an explicit backup ID number
            -R             Randomize area order
            -c <area_id>   Skip this file area, can be entered many times
            -s <size>      Media size in megabytes (1MB==1024*1024 bytes)


    pcfg
    ----
    A simple command line tool to configure pimppa, that is,
    to modify p_misc table. You can also use "bowser" or
    "mysql" directly.
    
    Usage:  pcfg [options]

            -k <key>       MUST, the configuration key
            -d <data>      MUST, the value associated with this key
    
    Without options it prints the current setup. See 5.6 
    for known keys.


    pchkfn
    ------
    Checks if a filename can be accepted to system (including
    rules, duplicate & assign pattern check).
    
    Usage:  pchkfn [options] <filename>

            -a <area_id>    An INCOMING area of which's destination
                            regexp patterns are used
            -s              Operate as kill program for suck
            -n              Nazi mode, returns nonzero if no positive 
                            destination area for file is already known

    NOTE: -s is obsolete. PIMPPA won't call "pchkfn -s" anymore,
    though you could use it if you run suck manually by hand with
    your own killfile, with PROGRAM=pchkfn -s.

    Returns 0 if the filename is ok.


    pclean
    ------
    Performs various delete operations. Use with caution!

    Usage:  pclean [options] 
    
            -a <area_id>    Purge duplicates from <area_id>
            -d              Delete files found in db from the current dir
            -f              Delete files which failed integrity check 
            -i <backup_id>  Delete files associated with backup id
            -n <area_id>    Delete files from <area_id> which have no numbers
                            in filenames.
            -o              Delete all offline files without backup id
            -p              'Purify' current dir
            -s              Simulate only, don't delete anything
            -t              Remove trash (files with file_area=-1)
            -z              Zuper slaughter, immense
            -v              Verbose execution
    
    -a : Deletes those files from <area_id> which exist on
         other areas. Works on filename basis only.
        
    -d : Deletes ALL files from the current dir which are found in the
         database. Don't use this on actual pimppa filearea dir.
        
    -f : Physically deletes all files which have failed the integrity
         check. Files will be marked as FLAG_OFFLINE and replaced later
         if the file is re-encountered e.g. by "pleech".

    -i : To be used after a successful backup operation. It deletes all
         files associated with a given backup_id, and gives them an
         FLAG_OFFLINE status to the database.

    -n : Delete files from <area_id> which have no numbers in filenames

    -p : Checks the current directory (which shouldn't be a filearea!)
         against your current file validity rules and assign patterns.
         Files which have negative assign patterns or violate the
         rules (e.g. NEEDNUMBER, see "src/pimppa.h") will be deleted.

    -s : Simulate execution only, don't change anything. Most useful
         with '-v'.
        
    -t : Deletes file-entries from database having file_area=-1. Such files
         can occur if you move files to 0DOOM instead of deleting them.

    -z : Deletes offline files from areas marked AREA_INCOMING,
         creates negative assign patterns for ALL of those files,
         and deletes the files from the database.


    pdesc
    -----
    Used to give text descriptions to files.
    
    Usage: pdesc <files> <"Desc">

           <files>          The files you wish to describe in current dir
           <"Desc">         The description you wish to give them

    e.g. 
    shell> pdesc goat*.jpg "Happy Goats"


    pf
    --
    "Pimppa Find". Searches for files from the database. By
    default it looks for filename-based patterns.
    Requires SQL style wildcards (use '% for *' and '_ for ?'). 

    Usage:  pf [-dpv] <pattern>

            <pattern>       Filename pattern
            -d              Try to match file descriptions instead
            -p              Print pathnames as well
            -v              Verbose, print file descriptions

    The printing format is '[path/][filename] [backup_id if nonzero]'.
   
   
    prm
    -----
    Deletes files from the filearea matching current dir. Both
    the physical files and the records in the p_files -table 
    will be deleted. A new assign pattern will be created 
    for each file having -1 as the destination area, so that 
    all similar files will be killed when encountered in 
    the future.

    Usage:  prm [-v] <files>
  
            -n              Nice kill. Don't create negative assign
                            patterns or delete the file entry from database, 
                            just mark the file as offline and delete the
                            actual file.
            -v              Verbose execution

    Use prm with caution. :---)

    Note: If you wish to keep the MD5 checksums of the files,
    preventing the renames of the same files entering your
    system again, you should use "prm -n" or move the files 
    to area 0DOOM, e.g. "pmv <files> 0DOOM".


    pleech
    ------
    This is the command line downloading tool. It fetches files from 
    the newsgroups or ftp's you have defined. In case of news,
    promising articles are downloaded, then decoded, duplicates are
    re-checked, and suitable files are assigned and added to their 
    appropriate fileareas if they match an assign pattern. Otherwise 
    they will be entered to the default destination area. FTP-downloading
    is similar.

    As default, pleech tries out all newsgroups defined from all
    servers you've defined them on.
    
    Usage:  pleech [options]
    
            -a <n>          Wait After n messages, def=10
            -f <urlfile>    Read (ftp-url, target area name) pairs from file,
                            leech the url recursively
            -g <pat>        Newsgroup name SQL pattern to match, def=all
            -H <host>       NNTP server name SQL pattern, def=all
            -i              Insert msg subjects into database as file descs
            -k              Keyword hunt. Accept only messages whose subject
	                        line matches a keyword pattern in p_rules.
	        -l              Lenient article selection while leeching news
                            (leech everything, select files after decoding)
            -n              Nazi behaviour. Try to skip all dl'ed files which 
                            don't match an already defined assign pattern.
            -q              Quiet operation (don't print newsleech BPS etc)
            -r              Restart newsgroup(s)
            -s              Sloppy download mode (take & accept everything)
            -v              Verbose output
            -w <secs>       Wait secs after [-a n] messages, def=2

    pleech depends on "suck" and "uudeview" for news, and "fget" 
    for ftp. They must be on command execution path. 
    NOTE: fget *NEEDS* to be patched (patch in "patches/" dir)!
    [The ftp support is most likely deprecated anyway.]

    Newsleech example:

    shell> pleech -v -g %supreme.moose%
    Verbosely downloads all active newsgroups containing 
    string "supreme.moose" in the group name.
   
    FTP use:
    
    With -f, the urlfile is a text file containing one "URL AREA_NAME"
    pair per line, e.g.

    ftp://ftp.gigaturd.org/pub/pics incoming
    ftp://user:pass@grindcore.com/core incoming2


    pmarkoff
    --------
    Goes through your fileareas and marks offline files
    (not available in their directories). Such files are 
    skipped by most operations, like viewing and backuping.

  
    pmd5sum
    -------
    This is used to create/print/lookup/scan MD5 checksums of
    the files in system. Some of the operations below affect 
    only online files.

    Usage: pmd5sum <options>

    -b <id>   Create md5sums for backup id <id>
    -c        Create md5sums for online files
    -l <sum>  Look up files matching a given hex MD5 sum
    -p <pat>  Print sums of files matching an SQL filename pattern
    -s        Scan whole database for collisions
    -v        Verbose operation
    -w        Wipe DB from MD5-colliding files (in the case of
              collision, the oldest file is kept).

    Other usage is self-explanatory except that of -b. If 
    you use it, you must either move the files back online,
    or the easiest way is that if you have backuped the
    stuff using pbackup, and all your fileareas reside under
    "/stuff", do something like "mv stuff old ; ln -s /cdrom /stuff",
    then do "pmd5sum -b <id>" and after that replace the
    softlink "/stuff" with the original directory.

    
    pmv
    ---
    PIMPPA mv. Moves files from the filearea matching current directory 
    to some destination area. The matching area will be found out
    by using getcwd() and p_areas -table.

    Usage:  pmv <files> <destination_area>

    Example:
    shell> pmv fish*.jpg aquarium
    
    would move fish*.jpg from current filearea to a filearea called 
    'aquarium'. The actual destination directory location is 
    fetched from the database.

    NOTE: Moving files around with just "mv" will mess up the database,
    e.g. pimppa won't know the files have changed their location.


    pnewarea
    --------
    Adds a new filearea to the database. Note that you must create
    the actual directory yourself with "mkdir". 

    Usage:  pnewarea -n <name> -p <dirpath> [options]

            -a             Mark area as AREA_NOASSIGN
            -c             Context number for this area, default: 0==global
            -i             Mark area as AREA_INCOMING
            -I <id>        Suggest id number for the new area
            -n <name>      Area name, MUST
            -t             Mark area as AREA_NOTRANS
            -T             RegExp dupecheck pattern, default $
            -p <dirpath>   Area directory path, MUST

    Example:
    shell> pnewarea -n pasture -p /stuff/pasture/
    
    would add a new filearea with name "pasture" and 
    directory path as "/stuff/pasture/". 


    pnewgrp
    -------
    Adds a new newsgroup to the system. Every group needs some 
    filearea as a default destination for the decoded files.

    Usage:  pnewgrp -d <dest_area> -n <group_name>

            -d <dest_area>  Destination area name, MUST
            -n <group_name> Newsgroup name, MUST
            -i <server_id>  Newsserver, default=1

    -i is used to tell which server this group is
    downloaded from. If you wish to download same group
    from different servers, just re-add it with all
    the server id's you want. Server 1 is the primary (default) 
    server. 


    pnewsrv
    -------
    Tool to add/modify newsservers known to the system.

    Usage:  ./pnewsrv -s <name> [options]

            -s <server_name>   Server name, MUST
            -u <user>          Optional username
            -p <pass>          Optional password
            -i <server_id>     Suggest newsserver id

    <server_id> is unique ID of this server. To set a 
    newsgroup to be leeched from a certain server, use 
    this number. Server 1 is the primary (default) server.

    
    ptest
    -----
    File integrity checker. Searches for untested files and
    tests them. Sets the database flag "file_integ" accordingly.
    The utils used for integrity checking of different filetypes 
    are defined in PIMPPA SQL table "p_types" and can be 
    modified from the GUI "bowser". If column "type_testokstr"
    is an empty string, the test command return value will
    be used instead. 0 == FILE OK, anything else == FAILED.

    Usage:  ptest [options]

            -n              Nazi behaviour. Delete files which fail
                            the integrity check. Use with CAUTION.
            -v              Verbose execution

    Note: "sql/example_types.sql" -file has example configs 
    for some integrity checkers and file transformers.
    

    ptrans
    ------
    File transformer. This can perform tasks like file optimization,
    conversion or perhaps add/remove spam (yuk). It searches for
    integrity-test passed files which have not yet been transformed 
    and tries to transform them. The utils for transformation are 
    defined in PIMPPA SQL table "p_types" and can be modified from 
    the GUI "bowser". 
    
    If some filearea has "area_flags" set to AREA_NOTRANS
    or AREA_INCOMING, the contents of the area will be skipped.

    Usage:  ptrans [-v]

            -v              Verbose execution

    Notes: 
    1) Successful transformation will usually modify the
       md5-checksum of the file.
    2) "sql/example_types.sql" -file has example configs 
       for some integrity checkers and file transformers.


8. Scripts
----------
These are just shell scripts, so you can easily edit 
them with a text editor to suit your needs.

Note: the viewing scripts are currently for picture material,
and default to "gqview" as the picture viewer. To change it,
modify p_misc key CFG_VIEWER (see instructions in 5.5).

All the viewing scripts affect all files anywhere on the 
PIMPPA system (unless they are offline or on incoming areas!). 


    p_areas
    -------
    Lists all your fileareas, their paths and their area ID numbers.
    
 
    p_con
    -----
    Prints out the contents for a given backup volume id
    in "Area | Megs" format.

    
    p_contexts
    ----------
    Prints out the contexts you have defined along with their id's and descs
    

    p_groups
    --------
    Prints all newsgroups and their target areas.
    
    Usage: p_groups <options>

            -a      Print only active groups
            -d      Print only disabled groups

    p_gtog
    ------
    Toggle active/disabled status of newsgroups matching
    given SQL newsgroup name pattern. Disabled newsgroups
    are not leeched by pleech or bowser.

    Examples: 
    shell> p_gtog %humbug% (toggles all groups with humbug in the name)
    shell> p_gtog alt.test (toggles just group "alt.test")
   
    
    p_leechctx
    ----------
    Downloads all newsgroups whose target areas belong to the
    given context.

    Usage: p_leechctx <context_id>
    

    p_loc
    -----
    Finds out backup id's containing files from a given filearea,
    specified by area name (or sql wildcard containing pattern).


    p_maint
    -------
    Useful to run daily from crond after "pleech". It just 
    performs "padopt", "ptest" and "ptrans". 


    p_prunejpeg
    -----------
    This script can be activated to be automatically run after news 
    decoding to delete too small jpegs. See Section 9.7 for details.


    p_rename
    --------
    Renames files on the database and on the disk.

    Usage: p_rename <oldname> <newname>
 
    Script submitted by A.E.
    
 
    pv_desc
    -------
    Views files matching a given SQL format file description. 


    pv_last
    -------
    Views files that arrived since the last run of this script
    (Bowser's "Extras/View since last" just executes this script.)


    pv_name
    -------
    Views files matching a given SQL format filename pattern.

    E.g. to display your kitten collection:
    shell> pv_name pussy%


    pv_since
    --------
    Views files which are newer than given number of days.


    pv_sql
    ------
    Views files by any suitable SQL WHERE statement. MySQL 
    manual is a suitable starting point if you don't know SQL.


    rc2sql
    ------
    Converts and inserts a suck .rc file (newsgroup list)
    to 'p_groups' table.


    viewdeep
    --------
    Actually not much to do with PIMPPA system. If you have used
    'wget' to mirror some website, but do not bother to click around
    the zillion directories, you can for example use 'viewdeep "*.jpg"'
    to give you a quick access to all jpg files in the current dir 
    and all its subdirs.


9. Some behaviour notes
-----------------------
How it works? What it eats? 


9.1 RegExp file classification rules
------------------------------------

A) Filename based rules, r_type == 0 

In table "p_rules" you can specify (with "bowser") filename
matching RegExp patterns, and tell pimppa to move the matching
files to areas of your choice (-1 meaning skip/discard, in r_target). 
The rules will be checked before the assign patterns (as the rules are 
human generated and assign patterns usually made by pimppa,
and I trust the user more). 

With this mechanism, you can for example always redirect
jpegs to one area and gifs to another, or delete
particularly annoying and re-occurring file series.

Note that pleech option "-n" can be used to specify operation 
where only such files are downloaded/kept that match positive,
known rules.

Examples:

1) Rule (r_rule="goat.*jpg",r_context=0,r_target=1) would 
   accept all files named "goat*jpg" globally (r_context=0) and 
   send them to filearea having area_id 1.
2) Rule (r_rule=".*hairy.*",r_context=1,r_target=-1) would 
   skip/discard (target=-1) all files having string "hairy" in the
   filename, if they were meant for any area belonging to context 1.

B) Keyword rules (r_type == 1)

Other rule type is the subject keyword rules. These rules
define patterns that must be matched on the messages subject
line for the message to be accepted. This ruletype can also
be used to automatically discard matching messages, in
that case set negative value for r_target (0DOOM). pleech 
option "-k" can be used to specify operation where only such 
messages are accepted that match some specified *positive* 
target rule.

Examples: 

3) Rule (r_rule=".*bear.*",r_type=1,r_target=1) would
   accept all messages that have "bear" in the subject line,
   providing that the rest of the checks pimppa does pass as well.
4) Rule (r_rule=".*llama.*",r_context=0,r_type=1,r_target=-1) 
   would globally reject all messages having "llama" on
   the subject line.

Both rule types follow contexts. Context 0 is the 
global context (all areas). Otherwise the rule will be
applicable only on newsgroups connected to filearea
having the same context.

See also section 12: "context".


9.2 Assign patterns
-------------------
If pimppa didn't find a matching rule from those you've set, 
"pleech" (and bowser Leech, which uses the same routines) tries to
decide based on assign patterns where it should store the files 
whose filename matches some pattern in the database. 

For each filename in the system there can be a (pattern, context, dest_area)
triplet in "p_assign" -table, telling where similarly named files should 
be moved in the future. (See chapter 12: "context"). Naturally all files 
belonging together should map to the same pattern for this scheme 
to do any good.

The pattern is constructed from the filename as follows:

1) All numbers are converted to character '0'.
2) Letters after the last number and before the last dot ('.')
   are considered as indexes, and converted to '1' IF there's
   no more than two of them.
3) After last '.', all alphabetical letters stay intact.

E.g. filename       =>  pattern
     --------           -------
     "ab-103-h.jpg" => "ab-000-1.jpg"
     "ab-115-z.jpg" => "ab-000-1.jpg"
     "ab-ccc-1.jpg" => "ab-ccc-0.jpg"
     "ab-1-1ab.jpg" => "ab-0-011.jpg"
     "ab-1-def.jpg" => "ab-0-def.jpg"
     "ab-01a-2.jpg" => "ab-00a-0.jpg"

The assign patterns are by no means foolproof. One reason
is different files being created around the world with same 
names. However, its fairly good with really big series 
having some uncommon filename prefix like "gwo-bah-???.zip".
But it fails with files named imaginatively like 
"image001.jpg" which surface on every corner.
    
Example: If you have a pattern "bozo_000.png" pointing to area 5, 
"pleech" would send files named "bozo_123.png" and "bozo_124.png" 
to area 5, but files "bozo_abc123.png" and "bozo_100.jpg" would 
end up on the default destination area.

PIMPPA utils like "pmv", "padopt" and Bowser automatically 
update and create assign patterns, and the whole pattern 
database can be reconstructed with "passign".

Assign patterns are not created or used for areas marked
as AREA_NOASSIGN or AREA_INCOMING.

NOTE: Special destination area (a_dest) values:

-1  :  A negative assign pattern destination area id will cause
       all matching files to be quietly discarded by "pleech" and 
       Bowser in the future. In english: KILL the matching files. 
 0  :  Destination 0 means that the particular pattern is 
       disabled and won't be used. The pattern won't be replaced
       by pimppa when matching files are moved or adopted. 
       Sorting those files will be left to the user.
>0  :  Some normal filearea.

The value 0 must be set by hand, e.g. in case you notice some
particular pattern causing incorrect classifications all the time.


9.3 Miscellaneous
-----------------
By default, "pleech" and bowser leech convert 
all filenames to lowercase, and discard all

1) duplicate files (see 9.3 RegExp dupecheck) 
2) MD5 -checksum colliding files (see 9.4)

For better spam avoidance, you should probably configure 
pimppa to discard

3) files that have no numbers in the filenames        (CFG_NEEDNUMBERS)
4) files that have only numbers before the extension. (CFG_NEEDOTHERS)
5) files that have whitespace in the filenames        (CFG_NOSPACE)

Usually such files are renames, spam or just plain nuisance. 

Use "pcfg" or "bowser" to change the settings to your
liking. See also 5.6, Changing Preferences.

Some additional behaviour options may be added in the future,
if some useful come to mind. I'll happily receive all 
suggestions and ideas!


9.4. Restricting dupechecking with RegExps
------------------------------------------
As a default, pimppa leech checks based on filenames that 
incoming files do not already exist on any filearea. 
If they do, the incoming counterparts are called duplicates 
and deleted (unless the already existing files have failed 
the integrity check - in that case they are replaced with
the new ones).

You may wish to tighten the duplicate checking to check 
only from particular areas.

Example case.

You get "goat100.jpg" from "alt.binaries.pictures.animals",
and later a file with the same name from "alt.worship.goatlord". 
Now there's a good change these are not the same files. You
might prepare for cases like this by relaxing the dupechecking
as follows:

Group: "alt.binaries.pictures.animals"
        => default destination area "0animaltmp"
            => set "0animaltmp" to dupecheck from areas "cats", "dogs", "sheep"
Group: "alt.worship.goatlord"
        => default destination area "0occulttmp"
            => set "0occulttmp" to dupecheck from area "weirdstuff" only

The dupechecking is set for the destination areas, not for the 
sources themselves. (E.g. many newsgroups may map to the same 
destination area and follow the same dupecheck patterns).

To set RegExp duplicate checking, just set a proper area_id
RegExp pattern for any incoming area (modify 'area_targets' -column). 

Examples:

$               Default, check from all areas
^1$             Dupecheck only from area with ID 1
^7$|^10$|^15$   Dupecheck from areas 7,10,15.

You can set these from "bowser" or directly by "mysql".

NOTE: RegExp duplicate checking also affects assigning by
leech operations: only those assign target areas are seen valid which
are matched by RegExp. If there is no match, default destination 
area is used. If assign target is negative (0DOOM), the file 
will be deleted (is this behaviour is wise?), no matter what 
RegExp says.


9.5 MD5 -based dupecheck
------------------------
For each file entered to pimppa system, a 128bit MD5sum
will be calculated. The sum is compatible with RFC 1321. 

If STRICT_MD5 is used (as default), pimppa utilities
will delete all incoming files which have an md5sum
colliding with some existing md5sum. This gives a really
good duplicate discarding system, though some innocent
files might be deleted because of false checksum collisions. 
After going through my database I didn't find such a case,
but over hundred "valid" collisions (which were renames: 
exactly same file, but with a different filename).


9.6 Downloading & file skipping
-------------------------------
How does the download work in normal, non-lenient operation? 
This might be useful knowledge if you wish to understand how 
(and why) pimppa discards incoming stuff.

For efficiency, newsgroups are downloaded in two phases. 
First, only headers are taken, then the articles themselves.

Header phase: suck will output only those headers which 
pass the user-specified suckkillfile (default: accept all).

Then each header is considered by pimppa. If no filename
matching P_FILETYPES is found, we skip this article. Otherwise
we have parsed a filename. First we check that this filename passes
the current requirements for filenames (example: NOSPACE) and
that its not a duplicate. Only after then we check the regexp 
rules for subject lines. After that, we check there is no 
a filename regexp rule or an assign pattern pointing to a 
negative filearea (=> kill this file). 

If the header passes this mechanism and all parts exist, 
the respective full articles are downloaded in the next phase.

Article phase: Accepted articles are downloaded.

Decode phase: Articles are decoded. After decoding, pimppa
re-checks the filenames that they're not duplicates or 
on their way to oblivion (assigns or rule pointing to negative 
area). This is done here again because the subject line does 
not always have the same filename as was included in the 
uucoded data of the message itself. Next, a prune script
is run (if its defined for the file type, see type_prunecmd) to 
decide if the file should be deleted or not. Finally, MD5 checksum 
of each file is matched globally and the file is discarded if a 
duplicate is found.


9.7 Prune scripts
-----------------
The prune scripts are a heavy, possibly content-based way of spam 
avoidance. They are defined per file type and can remain undefined. 
These shell scripts are executed after news decoding, before adding 
the file to the database. The script is given the full path of the 
decoded file as an input, and the script can use whatever measures 
it pleases to decide if the file is acceptable or not. It can look 
at the image statistics, dimension, content, and so on, depending 
on what you want -- and are able to code. 

There is an example prune script provided for jpegs, 'p_prunejpg' 
which deletes too small images (params at the start of the script). 
It needs to be manually entered to "p_types" table to be used.

The prune scripts can be defined to be used from Preferences
in "bowser", the field is "Filetypes/type_prunecmd". The script
must be on command execution path.


10. MySQL table explanation
---------------------------
Main database is "pimppa" and it's owned as by user "pimppa". Some
of these can be modified from "bowser" preferences.

"p_areas" is the table containing all your fileareas.

    area_id         Unique area ID number
    area_name       Area name. Should be logical and quick to type.
    area_path       The directory path for the files of this area.
    area_flags      Properties of this area (hints for utils)
    area_context    The group of fileareas this area belongs to
    area_targets    RegExp pattern for dupechecking incoming (see 9.4)

"p_assign" contains the assign patterns - where "pleech", "bowser"
and "padopt -m" should deposit certain files. Negative destination
area makes utils to delete the incoming file. Zero destination means
that this pattern is disabled.

    a_pattern   The filename pattern to match
    a_context   Context where this pattern is valid
    a_dest      Destination area_id for all matching files

"p_contexts" contains information about defined contexts, i.e.
groupings of the areas.

    a_name      A short name identifier for the context, unique
    a_id        A numeric context identifier, primary key
    a_desc      A free form description of this context

"p_files" contains all the files you have.

    file_id     Unique file ID number
    file_name   File name, unique per filearea.
    file_size   File size in bytes
    file_area   Area ID where this file should be
    file_integ  File integrity check status
    file_trans  File transformation status
    file_date   The date when you got this file
    file_backup The ID of the backup, 0 if none.
    file_desc   Optional ASCII text description of this file
    file_flags  Is there something special with this file (offline?)
    file_md5sum 128 bit MD5 checksum for this file
    
"p_groups" contains information about newsgroups.

    g_name      Unique name of the newsgroup, e.g. "alt.binaries.test"
    g_last      Last msg read -pointer
    g_flags     Newsgroup flags (can be |= GROUP_DISABLED)
    g_dest      Destination filearea id
    
"p_misc" is a really general table for various PIMPPA-utils
to store their status information and configuration.

    misc_key    Identifying unique key for the info
    misc_data   The actual data associated with the key

"p_rules" contains user-specifiable RegExp file/message classification rules

    r_rule      The regexp rule itself (eg. ".*jpg") matching some filenames
    r_context   Context where this rule is valid
    r_type      0 == FILENAME PATTERN, 1 == SUBJECT KEYWORD PATTERN
    r_target    Area_id where the matching files should go
    
"p_servers" has all the newsservers you want to use.

    s_id        Unique server id
    s_name      The server hostname (e.g. news.hypermecha.com)
    s_user      Username on the server (empty=none=default)
    s_pass      Password on the server (empty=none=default)
    s_flags     Server flags

"p_types" contains the information how to handle various filetypes.
If type_testokstr is an empty string, the test command return 
value will be used. 0 == FILE OK, anything else == FAILED. For
the prune command, return value of 0 means that file should be 
kept, and 1 that it should be deleted. Other values will be
interpreted as error (operationally as 0). Be careful. PIMPPA will 
provide both the test and prune commands the file path as input.

    type_ext        File extension of this type, e.g. "JPG".
    type_testcmd    Command to use to test a file of this type 
    type_transcmd   Command to use to transform a file of this type
    type_testpos    Position of the success string for testcmd (deprecated)
    type_testokstr  The actual "all correct" string given by testcmd.
    type_transto    Destination file type when transforming.
    type_prunecmd   Command to use to decide if a file should be deleted

Trick: set "type_prunecmd" to "false" to delete all files of this type. :P


11. Feedback & discussion
-------------------------
PIMPPA has been maintained by myself with the help of
various contributors over the net. For discussion related to 
this software or general file-hoarding, please use the users
mailing list freely. And no need to write like I do. Relax. :D

<pimppa-users(-at-)lists.sourceforge.net> 

For something related to the developing, please send your 
messages to the developer mailing list. Any bug-reports, 
patches, comments or suggestions are welcome. Especially 
if you have an idea about some new functionality/tool/script 
that would benefit PIMPPA, don't hesitate. And if you know how i
to actually do it, all the better! 

<pimppa-devel(-at-)lists.sourceforge.net>. 

If you like, you can also contact the head honcho directly,
<iwronsky(-at-)users.sourceforge.net>. PGP key at 
the end of this file for the paranoid.


12. Glossary
------------

    "context"
    ---------
    A group of fileareas. Assign patterns and classifying rules 
    can be specified to function only in specific contexts. This
    happens by setting a nonzero context number for a filearea and
    the same number for your rule(s). Patterns and rules specified 
    for context 0 (global) are always matched, unless a >0 context 
    rule exists.

    Note that if you use contexts (default is "all are global"), 
    you probably shouldn't have any filearea belonging to 
    context 0, because if you kill files there, you also 
    invalidate those filenames for the other contexts. 


    "0DOOM"
    --------
    This is a point-of-no-return -filearea which should exist
    on all systems. Its area_id is "-1" and area_path "/dev/null".
    All files moved or assigned to 0DOOM will never be seen again.

    MD5 -checksums of the moved files will be kept but the 
    files assigned there in the future won't leave a trace.


    "assign pattern"
    ----------------
    Downloading decides based on assign patterns where files
    should be stored whose filename match an assign pattern in the 
    database. Section 9.2 tells how the patterns currently operate.
   

    "filearea"
    ----------
    PIMPPA is structured so that every file is on a certain 
    filearea. Fileareas are created with "pnewarea". All files of 
    similar content should be on a certain filearea, so you can 
    easily find them. It's quite like a normal directory, 
    except that some additional info of the filearea contents
    is kept in the p_files database to make the backup, duplicate-
    check, lookup, etc, operations possible and fast.


    "Incoming" (AREA_INCOMING)
    --------------------------
    Fileareas to which "raw" material from newsgroups is decoded
    to. Incoming area contents will not be backuped or transformed. 
    
    In ideal operation pimppa is like a sorting network,
    the files can be seen travelling like this:

    Newsgroup  Filter 1        Filearea     Filter 2         Filearea
   
    group1-| 
           |->-[autofilter]->- Incoming1 ->-[humanfilter]->- Quality1
    group2-|      |  |                         |                |
                  |  `------------>--------------->-------------'
                  |                            |
                  Discard                      Discard

    etc. Due to the assign patterns, recognized files can be
    moved to correct fileareas without human interaction. Also
    the duplicate checking (filename and md5check) and certain
    requirements for filenames allow some incoming files to 
    be discarded automatically.
  
    Filter 1 mentioned above is described in 9.6.


    "transform"
    -----------
    Operation which can be performed (once!) for a file of
    a certain filetype. This operation can be a conversion
    of ".GIF" to ".PNG", an optimizing of a JPEG, or whatever.
    You can specify the transformation command and result
    filetype (which can be same as source type) in 'p_types'-
    table.

    Some example transformations are in "sql/example_types.sql"


    "offline"
    ---------
    Files which exist on some filearea but not in its respective
    directory are called offline. The files may have been backupped 
    and deleted, or just lost. Most PIMPPA utilities skip offline 
    files. Files which are present are called "online". 


    "PIMPPA"
    --------
    PIMPPA is a fabulous content seeking/devouring creature 
    or monster in the forgotten scandinavian mythologies.
    

13. PGP
-------
This is my PGP public key. Do you trust it hasn't been
tampered with by MAN IN THE MIDDLE? Hell no. Just send 
normal mail like everyone else. ;)


-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.0.1 (GNU/Linux)
Comment: For info see http://www.gnupg.org

mQGiBDawXNERBADWkpWZoDuB//oUsvlmCKNufobFKdTX1SGfqDffoHgOO+01LVU9
QZl+snoSoAz7TkRnEep34mdRcx8IRe/xuLi88jEiI2zVLCHoYFBDB6lpxfSGYjJq
IfqCxv1GKMTigykLI5oxyLgYyrONL00uZzPRuTnSrwZqWuqqoHANlxed1QCg/3Xy
FLgvmPRZkBVZ7cjAV+ZlNEEEAKNFmfPGTwKd5X0MBSI4aIUXKO3G+Narjn9vVDoi
LsU41LR3VO50UeWvnbSEi93YTDwXpFuKK4nMKG1Z5T8pZPQ6NqcOu6NxBm6GHMoR
L6ZI+oRgbdAG2shTlfbnZ10qyPhTg0VDf+zfD7sx/vK7jo5uywrmiTUvQvEli4ps
6KCFBADGSC6Uf/5lNB7+4VAko0j1G6m3cnOekvl6/XX+FihPob9NECZbiJR3lvzR
ayuJaOSoX/zNJEHDzfOu1qzjOvbWBWY+JBkJ1QYvpB1A+Y84QFcBJRlrvTcXxgfv
urMkptlO3yUY7UPKIt7+52XTAesfWZkmkgC9cPQ5gD2l1FjyZbQhSWdvciBXcm9u
c2t5IDxpd3JvbnNreUB5YWhvby5jb20+iEsEEBECAAsFAjd9HuMECwMBAgAKCRAg
/QqD5XhjksYfAKD8QRg9EPsiLacwNPTGppusFCCvpQCg6k0tB7FVJaLjunbWKItZ
LuPHvFm5Ag0ENrBc0hAIAPZCV7cIfwgXcqK61qlC8wXo+VMROU+28W65Szgg2gGn
VqMU6Y9AVfPQB8bLQ6mUrfdMZIZJ+AyDvWXpF9Sh01D49Vlf3HZSTz09jdvOmeFX
klnN/biudE/F/Ha8g8VHMGHOfMlm/xX5u/2RXscBqtNbno2gpXI61Brwv0YAWCvl
9Ij9WE5J280gtJ3kkQc2azNsOA1FHQ98iLMcfFstjvbzySPAQ/ClWxiNjrtVjLhd
ONM0/XwXV0OjHRhs3jMhLLUq/zzhsSlAGBGNfISnCnLWhsQDGcgHKXrKlQzZlp+r
0ApQmwJG0wg9ZqRdQZ+cfL2JSyIZJrqrol7DVekyCzsAAgIH/AzQXkgMRpwsbEVh
XSEH/5kbN4Ls9LbFMkPelmaODl2W2wjmWa+7loBFnKn+9WHh77/GLMzHGPYoTzZv
wp6bAYbcq4cu20qdW2tTIfUXJz+ey3r5rwFR5y5qkiBqfFczepY0biUcUI7dWt/Q
LUyN6oVyVAjclmfvA/JWi7LmMRl6Jo1doKXLYhHOuFXkoGqIExrO9EUKTMGsa0Lm
uJVv6kb0v9EAyiJU/zzMvKotPtdzzPqz2m+0mt/XsMhfbT6xl2XkmESvQhgev6Yh
DYpzVSZOZeZ7Etzpp2eDwfP4AU23ge6KFO7g33cSEJilBe7x3ZkiTb5Hgqs3FnWi
t9EqwDmIPwMFGDawXNIg/QqD5XhjkhECewUAn3P1gtt1Y3DZRRWvJ9TgNCtc+qcp
AJ9p6oceyAzcCw87KNm3kW7u6gBK6g==
=Hq/M
-----END PGP PUBLIC KEY BLOCK-----


<EOF>
