(Does anybody read TODO files but their author?)

##########
# If you are interested in writing a module, the Skeleton module is the best
# place to start.  Please submit modules and/or patches to Bob McElrath
# <mcelrath+filterproxy@draal.physics.wisc.edu>.  See also the mailing lists
# and sourceforge project page: http://www.sourceforge.net/projects/filterproxy
# 
# I have devoted considerable thought to the idea of making modules a blessed
# class.  I am undecided as to whether this will provide a net benefit, or just
# slow things down, and make handling of configuration a pain.  If you are
# writing a module, or considering writing a module, please think about this
# and let me know your thoughts.
########## 

[Perl WARNING] Use of uninitialized value in anonymous hash ({}) at /usr/lib/perl5/site_perl/5.6.0/HTML/Mason/Request.pm line 146.
    bug report sent to mason mailing list.
XSLT:
    http://www.nwalsh.com/docs/tutorials/xsl/xsl/frames.html
        Tutorial.  XSLT doesn't seem to have regex matching capability.
        Rather, it transforms structure.
    http://www.zvon.org/xxl/XSLTreference/Output/index.html
        Another reference
        under XLab->string functions it has some rudimentary string matching
        functions which might be useful for ad-stripping.
    Make stylesheets display something about what they do.

Add a M$ HTML-UNfuckifier (hardcode - not a rule) - convert to proper HTML entities.
    Maybe make a module for this?
    http://www.perl.com/language/misc/demoroniser.html
    http://www.time.com/time/2001/inventions/clothing/index.html    
            M$ seems to have invented some new characters.  Again.
           &#154;F      I don't think this is the HTML entity they intended.  (It's not "degree")

http://bobby.cast.org/bobby/bobbyServlet?URL=http%3A%2F%2Fdraal.physics.wisc.edu%2FFilterProxy%2F&output=Submit&gl=wcag1-aaa#g245
                                                        how to make my web page more friendly to
                                                        unsighted people.  (important if XSLT and fp
                                                        become widely used by the blind)
http://www.opengl.org/                                  mangled.
    dunno what to do about this one.  Sucking in <td> destroys the document.
    Do I want to create a site rule for this one?
http://freewarepalm.net/doc/doc_ebook.shtml             mozilla shows compressed data! (and headers)
    (may have to go up a page and click on the e-book link.  Looks like leading 0 in page is causing trouble)
    I can't reproduce this anymore...
http://www.cnn.com/2001/TECH/science/12/06/physics.reut/index.html
                                                        flash ad...
                                                        bah, the page expired.  Fuck cnn.
rule YAHOO_JAVA: 
    YAHOO_JAVA: strip regex /ADVERTISEMENT/ inside tagblock <font> inside tagblock <table> add encloser <table>'
    changing order of regex and tagblock <font> slows it down by 1000x.
http://resources.cisco.com/app/tree.taf?asset_id=75234  mangled.

Backreferences ($1) in regex matcher.

Connect to proxy as a web server (not proxied)...cases infinite fork bomb.
    See John Waymouth's messages.
    Is this fixed?  I can't reproduce it anymore...

Add to webpage:
    http://www.mozilla.org/unix/customizing.html
    http://home.c2i.net/dark/linux.html#fuzzy       netscape fonts
    http://www.geocities.com/pratiksolanki/         same info as the customizing page?

When rewrite markup is later stripped, it does not show properly, and is
inserted in the wrong place.
    http://dailynews.yahoo.com/h/nm/20011012/pl/attack_congress_security_dc_25.html

Show filtering fails on error (Mozilla spinner just spins...)

"Reload this page without any filtering" javascript bookmark.

Requesting config pages with proxy off doesn't work. (-> search.netfuck.com)

Test using FilterProxy with another proxy that requires auth.  Does "Host" logic
in Header.pm to remove Authorization: header break it?

FilterProxy on sourceforge: /home/groups/f/fi/filterproxy/htdocs/

Error messages are wrong (try to connect to localhost on chani...give 500 Timeout)
should be Connection denied or something.  FIXME: workaround enabled.  I think
these bogus error messages are from perl 5.6's IO::Socket.  (See IO::Socket::new)
    TODO: patch IO::Socket and submit.

Don't decompress if there are no content filters.

Option to use Digest Authentication.  (In FilterProxy/Auth.html)

Latest Mozilla 5/24/2001 no longer closes each connection, BUT:
    1) It sends the request, gets a Proxy-Authenticate, (closes connection), sends
        Proxy-Authorization with duplicated request.  (every time)
    2) It closes all connections after successfully loading a page.

Rewrite: merge all tag name regexes into one, and look for any tag that matches.  
Then see which rule matched (re-match tagname against each rule's tag regex),
then check attribs, then apply predicates for rules that succeed.  -- Only 
need to traverse the file ONCE!

Send patch for pop_header, Proxy-Connection to Gisle Aas (libwww maintainer)

Chunky processing.  (Rewrite and Compress)  Give Rewrite a window in which to
apply rules (1k), then process in 2*window sized chunks (in case match falls
on boundary).  Will slow things down (because everything is processed twice),
but also speed things up (since 'add' can't add search beyond the window).
And it will speed things up if the processor is faster than the network.  Will
this be a net speed up or slow down?  (can always set chunk size large -- then
will only affect really large pages)

Proxy-Authorization business: Proxy-Authentication-Info header? (rfc2617)

Implement close of connection to server when client closes connection.

Web page: screenshots of pages before/after filtering.  Must move web pages to
different site (sourceforge?) because my DSL line won't handle the bandwith of
all these images.  ;)  Candidates:

    http://news.cnet.com/news/0-1006-200-5044903.html?tag=st.cn.1.lthd

SSL Support for libwww-perl: IO::Socket::SSL, Crypt::SSLeay

Incorporate HTTP::Daemon, HTTP::Request, HTTP::Response, HTTP::Message, HTTP::Status
    LWP::Protocol::http.  Modify as necessary to support basic HTTP/1.1 for requests.
    See Also a generic perl server: http://seamons.com/net_server.html

Obey 'Cache-Control: no-transform' (rfc2616 13.5.2)  FilterProxy probably breaks 
digest authentication currently.

Deflate compression (Compress.pm) seems not to work, try url:
    http://groups.yahoo.com/group/http-wg/messages/8730?expand=1
    Yahoo now uses gzip?  Not sure if this bug still exists?

Make a function HTTP::Headers::pop_header (to corresponding push_header)

Check out Thread::Signal, which may allow use of SIGCHLD handler again.

Take a look at Devel::leak and try to track down memory leaks.

Use Image::DeAnim module from CPAN rather than current code for DeAnim.

Transfer-Encoding (rather than Content-Encoding).  Client must send 'TE' header
indicating gzip capability.  This is now implemented in FilterProxy::Compress,
but I know of no browser that will send the TE header, and properly decode the
Transfer-Encoding header.  I have filed bug 59464 for Mozilla:
    http://bugzilla.mozilla.org/show_bug.cgi?id=59464
See also bugs 68414, and 35956, and this apache module:
    http://www.mozilla.org/projects/apache/gzip/
This is implemented in FilterProxy.  (there is a checkbox for it on the Compress
config page)  If you read this, and know of a browser that uses it, please
tell me.

When config has changed, change If-Modified-Since headers to be the 
date/time of the last config change.  (so Netscape doesn't grab cached,
filtered files when the config changes) (Need Header module...)

Add interface for stripping data out of html files by interpreting the
structure.  i.e. for feeding to a script, database, palm device, etc.  For
example, just strip today's weather out of a file and do something with it
so I can put it on my Palm.  (part of Rewrite module?)  For example, the
ShowTimes program for Palm 'getdata.pl' script.

stream filter framework, convert Compress to a stream filter.  HTTP::Daemon can
handle 'Chunked' encoding.

allow Compress to configure its compression level.

Central server for site-based filtering preferences?

"Variables" for use by modules.  For (Andover filter), or list of ad servers.
Should these automatically be joined by '|' when used?  Are there uses for
variables that won't end up in a regexp?

Other fork models: prefork.  Check out NetServer::Generic, which implements a
preforking server.

Use instead of open() to load a new page:
    window.location.href = "http://www.netscape.com/"; 
(old) javascript FAQ: http://www.irit.fr/ACTIVITES/EQ_HECTOR/pouilly/HTML/javascript.html
    escape nasty characters in a url: javascript:escape("...");

Use http://chani:8888/MODULE/?page
    to request show-me-what-was-filtered?
    send to MODULE::Config?

find an API to get hostname rather than `hostname`.

TODO-by-module, and ideas:
--------------------------

Profile Agent module:
  1) acts as a Netscape roaming profile server, so that it stores bookmarks
     and netscape preferences.  
  2) Allows web-based access to a user's bookmarks (so friends can see them).
  3) Has a "search engine" which indexes content on bookmarked pages (and
     pages linked to from a bookmarked page, on the same server), so that you
     can find things in your bookmarks by a search engine search.  
  4) Allows "classification" of bookmarks in a more sophisticated manner
     (preferably by keyword, rather than tree), and then can generate
     yahoo-like indexes by keyword.  (or by searches for a keyword).  
     
  For instance, I might bookmark the homepage for xmms (http://www.xmms.org/)
  which I would then classify by adding the keywords (mp3, linux, audio,
  music, eyecandy, earcandy, X11, software).  Then when I do a search for
  "software" using this module's interface, I get all items which have the
  keyword, including xmms.  If I search for "linux sofware", I get all things
  with these keywords, etc.  You get the idea.  (You could make a yahoo-like
  index, or filesystem-like path by joining keywords "/linux/software/mp3"
  note this is the same as "/software/linux/mp3")  (Does anyone else but me
  have thousands of bookmarks, and occasionally think "I saw a piece of
  software that does X", and then spend 2 hours manually searching your
  bookmarks?)

  For bonus points, add a web spider that will search documents linked from
  the bookmarked page, and add them to the search engine's database.  (This
  way you could find info by searching that you've never seen, but is closely
  related to something you've bookmarked).  Add an intelligent AI to help
  distinguish relevent material.

  For bonus bonus points, add the capability for the spider to use Netscape's
  "What's Related" (or similar) interface to find things similar to the page
  bookmarked, and index them too.

  For bonus bonus bonus points, make sure this doesn't get exploited by
  advertisers.

  This could be an entire Ph.D. project on software agents.  Any takers?

Header module:
  Pass user info to Header module, let Header do authentication.

Rewrite module:
  An alternate way to write Rewrite would be to "tree" the entire file before
  applying filters.  That is, go through it and fill a data structure with the
  location of every tag and its length (using HTML::Parser?).  Then compile all 
  filters, and traverse the tree once.  Filtering time is proportional to 
  number of times the file is traversed, so this may be faster.  (maybe not, 
  treeing is slow)

  Yet another way would be to take advantage of HTML::TokeParser, and use
  HTML::TokeParser::get_tag("img").  Don't know if this would be any faster.
  (Is not faster because TokeParser still uses HTML::Parser, which calls a
  function for EACH TAG...so the time to traverse is proportional to the number
  of tags.  I am not able to use HTML::Parser on any reasonable documents in
  less than several seconds.)

  * Add insert command
  * Add regex commands for s///, tr///, y/// (sslsllllooooowww....)
  * profile for speed.  (Rewrite is ssslllloooowwww....)
  * make sharing of filtering rules easier.  (how to send a rule to a friend?
      how to document/comment what a rule does?)
  * rule to get rid of M$ FrontPage ? characters
  * Add a "maxdistance" option -- enclosing block can be at most maxdistance
      away. (a la Proximitron)
  * Allow multi-line rules: (for readability)  -- already possible? (no)
      strip tag <a href=doubleclick.net>
          -growto tagsurrounding </table|form/>
  * Boolean operators in tag specifier?
      strip tag <img width=1 | height=1>
    which is different from:
      strip tag <img /width|height/=1>
  * Extend tag matcher to match comments (try not to make it examine *every*
    comment:
      tag <!-- /Begin Ad/ -->
  * Match { ... } addbal;
    tag <img src = /doubleclick.net/> inside tagblock <table width=468>
      -growto ... -addbal -addalt
    tagblock <table width=468 height=60> {
      containing tag <img>                        # how to require more than one
      containing tag <a href = /doubleclick.net/> # 'containing' (AND case) separator?
    } -addalt -addbal
    tagblock <table width=468 height=60> containing {
      ( 
        tag <img> AND 
        tag <a href = /doubleclick.net/>
      )
      OR tag <a href=/flycast.com/>
      OR regex /Begin Ad/
    } -addalt -addbal
    regex /doubleclick.net/ inside {
      attrib </img|script|i?frame/ src>
      OR attrib <a href>
    } -addbal -growto tag </table|(no)?script/>
  * Conditionals: containing, inside, before, after
  * Booleans: AND, OR
  * Boolean grouping: by ()
  * Block: delimited by {}
  * Modifiers: -addalt, -addbal, -growto, -check
    -check checks that document structure is preserved (all tags balanced, 
    all closers accounted for) -- what about case where tag closer is absent?
    (i.e. <p>, <hr> etc)
    rewrite attrib <font size=1> as size=2
    strip attrib <font size=-1>
    regex (grab args as qr{} and apply it) (use tr// to get rid of M$ chars)
    tagblock (replaces tag -tagblock)
    tag (replaces tag -tagonly)
    tagsurrounding (replaces tag -ifencloses)
  * Take a look at HTML::TableExtract
  * For each piece of the document to be stripped check that it has no open 
    tags in it.  i.e. visorcentral.com -- stripping <a> leaves open <font> due to:
        <font><a href=...>name</font></a>
  *!! Speed up: instead of using substr(...) = ... to rewrite portions,
    copy the document exactly once.  At each failed match, copy the part of
    the document up to that point to a 'newdoc' variable.  Could be bad for
    really large documents.  But then substr(...)=... is even worse on
    large docs.  (Benchmarked this, it's about 100 times slower than the 
    current implementation)
  * implement a way to remove a frame, and grow other frames into its space. (netzero.com)
  * Give "add" a max size (as % of page) that it can grow to, to prevent sucking
    in whole pages.
  * Keep list of "unfound" alt contents, and their position.  If the position 
    in the page is the same, add it via growto.  (other ads sucked in -> alt 
    content separated by ad.  i.e. <script></script><layer></layer></noscript></nolayer>
  * MS FrontPage de-stupid-ifier.  See demoronizer.  Need regex tr/// to do this in Rewrite.

Status pages:
  Check out IPC::Shareable (and related) for this.
  Be able to provide the following information via served web pages:
    * Number of proxy connections open
    * IP addresses of clients with open (and closed) connections
    * Amount of data transferred to each client
    * Proxy-authenticification administration (via header module configuration?)
    * History of URL's loaded.
    * Boolean flags from FilterProxy.html (filtering, debug, etc), allow POSTing
      of new values.
  Requires: 
    * Ability to communicate with group leader process 
    * Ability to communicate with other child processes
    * .html file (Parse::ePerl) must be given this data in %ENV
  Methods to do this:
    * Have each child keep a two-way pipe open with parent.  Children
      write data to this, parent collects it.  Any child may request
      data from parent (have data given to children in a Data::Dumper
      format, which can then be eval'ed by the child?)
    * Kids only write data to pipe when parent requests it.  (for speed)
      Each child must keep internal statistics.  When a status page is
      requested, child asks parent for statistics, parent gathers it
      from kids, feeds it to serving kid.

Proxomitron module:
  Proxomitron is a windoze program which does many of the same things as
  FilterProxy.  See: http://members.tripod.com/Proxomitron/.  One person
  has expressed interest in loading its filters.  (are there others?
  Will someone volunteer to write this?)  It would involve:
    * Interpreting Filter configuration file:
        <field> = <value>
    * Implement each of the fields.  (Name, Active, URL, Limit, Match,
      Replace, others?)  Note that implementing match and replace will
      involve converting glob expressions to regexps.  Is there an
      existing perl module to do this?
    * applying the filter.  Looks like this may be as simple as a s///g
      inside the filter method.
  This is somewhat less than elegant, since the Proxomitron module will
  have to run for every URL, and then compare the URL to its own
  internal list of filters (and what url they filter), duplicating
  functionality already provided by FilterProxy.

Mapper module:
  works at orders 10,-10 and re-writes url's.  for instance: 
  Get "printer-friendly" version of articles at various news sites.  
  Block requests to known advertiser's domains that may have slipped through 
    Rewrite/BlockBanner.  (variables useful here)
  Example: http://ww.byte.com/column/(\w+) -> http://www.byte.com/printableArticle?doc_id=$1

Mirror/Cache module:
  Would work just like netscape's cache, except on the proxy side.  I'd like 
  a more sophisticated interface than netscape's "cache size".  For instance,
  cache all images from a site, only cache certain content-types, etc.

  Any cache implementation should look at the cache control headers
  (Cache-Control, Pragma, Last-Modified, Expires, etc) to determine if it
  can cache the content.  You'd have to store the Last-Modified and
  Expires timestamps with each cache file.  Then when a page is reloaded,
  and a piece of content is already cached, you can either return the
  cached object to the client (depending on expires, last-modified etc
  data stored with the object), or send a request to the server with an
  If-Modified-Since header, the value being the timestamp of when the
  cache object was created.  If the object has not been modified, the
  server will return a 304 (Not Modified) response.  If it has been
  modified, it should return the new object to you, which you can then
  pass to the client.

  The Netscape option "compare document to network" controls how often
  netscape sends requests for already-cached objects with
  If-Modified-Since headers.  All those 304 responses causes a lot of lag
  over modems though, since modems suffer from latency problems, the more
  requests you send, the slower it is (even if the request is really
  small).  Also FilterProxy really needs HTTP/1.1 compliance in outgoing
  requests.  That way it can send Connection: keep-alive headers, and send
  more than one request to the server on a connection.

  I have also thought about a module that could scan incoming html
  documents for images, etc, and load them into the cache *before*
  FilterProxy receives requests from the client for these objects, and
  when the client finishes processing the html and requests the images in
  a page, they'll already be in the cache...



Resources/Info:
------------------------------------------------------------------------

Netscape romaing profile server info:
  http://help.netscape.com/products/client/communicator/manual_roaming2.html
  Basically it involves setting up:
    1) Authentication
    2) HTTP MOVE and PUT commands
    3) Setting up server directories for users
        (this is a good idea anyway, for storing user's configs)
  Check out Apache::Roaming module.
  Also Ht://dig (HtDig, and CPAN module HtDig::Database for accessing it)
    ht://dig is in c++ and would have to be run separately.

Accessing url's with localhost through proxy will grab urls from the  server the proxy
is running on...possibly gaining "localhost" access remotely.

Daemon Contest: (*sigh* I never received the prize for this...)
    http://webreference.com/perl/dcontest.html

Other proxies
    http://www.junkbusters.com/ht/en/ijbfaq.html
    http://www.junkbusters.com/ht/en/cookies.html
    http://internet.junkbuster.com/cgi-bin/show-proxy-args
    http://www.cis.ohio-state.edu/htbin/rfc/rfc2109.html
    http://squid.nlanr.net/Squid/
    http://www-math.uni-paderborn.de/~axel/
    Junkbuster's list: http://www.waldherr.org/blocklist
    http://stockholm.ptloma.edu/httpi/  (minimalist perl HTTP server...may be useful)
    http://proxys4all.cgi.net/public.shtml (list of anonymizing proxies)
    http://webcleaner.sourceforge.net/  (Bastian's webcleaner -- very similar to 
        FilterProxy, but in python)

Roaming profile info:
    http://www.linuxworld.com/linuxworld/lw-1999-06/lw-06-penguin_3.html
    http://www.xs4all.nl/~vincentp/software/mod_roaming.html
    http://www.esat.kuleuven.ac.be/~vermeule/roam/put

Javascript info:
    http://hotwired.lycos.com/webmonkey/98/29/index3a.html?tw=programming   (image preload)

Bugs:
URL's which don't display correctly:
    http://www.spacelinks.com/SpaceCareers/Jobs-23April.htm?cnb takes WWAAAAAAAY too long.  Good case
                                                        for profiling analysis.  (lots of 1x1 gif's)
    Try to download staroffice with proxy.  (may be netscape cache interaction)

    http://mcelrath.net/ Doesn't load nameplanet logo in Netscape 4.5
        dead site?

http://www.law.com/cgi-bin/gx.cgi/AppLogic+FTContentServer?pagename=law/View&c=Article&cid=ZZZXFR3C6SC&live=true&cst=1&pc=5&pa=0&s=News&ExpIgnore=true&showsummary=0
    Mangled.

http://msnbc.com/news/637877.asp
http://www.osnews.com/story.php?news_id=141
http://www.macosrumors.com/
http://www.novadev.8m.com/
http://www.osnews.com/story.php?news_id=161
http://www.theonion.com/onion3736/freedoms_curtailed.html
http://plaza.powersurfr.com/bert/evidence.htm
    ads
http://yahoo.com                                hoseage
http://www.tamara.com                           porn popups
http://sanfrancisco.bcentral.com/sanfrancisco/stories/1997/01/13/editorial1.html
                                                lots of blank space at top
http://dailynews.yahoo.com/h/nm/20011012/pl/attack_congress_security_dc_25.html
                                                ad box at bottom.
<a href="http://rd.yahoo.com/M=211313.1640698.3177032.1472244/D=news/S=7666459:T/A=746725/R=0/*https://www.consumerinfo.com/cb/yahooccms/form_online_a1.asp?sc=14851001">
http://lhd.zdnet.com/db/dispnewsitem.cgi?DISP?1401
                                                ad at top.



