XmlDiff TUTORIAL
================

:Author: Sylvain Thnault
:Organization: Logilab
:Version: $Revision: 1.4 $
:Date: $Date: 2003-10-08 09:34:12 $

.. contents::

Synopsis
--------
::

    xmldiff [Options] from_file to_file
    xmldiff [Options] [-r] from_directory to_directory

    Options:
      -h, --help
	 display this help message and exit.
      -V, --version
	 display version number and exit

      -H, --html
	 input files are HTML instead of XML
      -r, --recursive
	 when comparing directories, recursively compare any
	 subdirectories found.

      -x, --xupdate
	 display output following the Xupdate xml specification
	 (see http://www.xmldb.org/xupdate/xupdate-wd.html#N19b1de).
      -e encoding, --encoding=encoding
	 specify the encoding to use for output. Default is UTF-8

      -n, --not-normalize-spaces
	 do not normalize spaces and new lines in text and comment nodes.
      -c, --exclude-comments
	 do not process comment nodes
      -g, --ext-ges
	 include all external general (text) entities.
      -p, --ext-pes
	 include all external parameter entities, including the external DTD
	 subset.

      --profile=file
	 display an execution profile (run slower with this option),
	 profile saved to file (binarie form).



Detailed example
----------------

if you process two files file1 and file2 which respectively contain: ::

    <memory>
      <mailbox path="/var/spool/mail/almaster"/>
      <server-socket port="7776" recipe="pia.PDA"/>
      <server-socket port="7777" recipe="proxy.Web proxy"/>
      <email_addr mine="yes">almaster@logilab.org</email_addr>
      <junkbuster-method value="18" />
      <spoken-languages>
       <language name="italian" code="it" />
       <language name="english" code="fr" />
       <language name="english" code="en" />
      </spoken-languages>
    </memory>

and

::

    <memory>
      <box path="/var/spool/mail/almaster"/>
      <server-socket port="7776" recipe="pia.PDA"/>
      <server-socket port="7797" recipe="proxy.Web proxy"/>
      <email_addr mine="yes">syt@logilab.org</email_addr>
      <junkbuster-method val="18">
       <newson/>
      </junkbuster-method>
      <spoken-languages new="new attribute">
       <language name="english" code="fr" />
       <language code="it" name="italian" />
      </spoken-languages>
      <test>
       <!-- this is an append test -->
	hoye!
      </test>
    </memory>


executing *xmldiff file1 file2* will give the following result: ::

    rename_node, /memory[1]/mailbox[1], box]
    [insert-after, /memory[1]/junkbuster-method[1],
    <spoken-languages new="new attribute">
      <language code="it" name="italian"/>
    </spoken-languages>
    ]
    [insert-after, /memory[1]/spoken-languages[1],
    <test>
      <!-- this is an append test -->
    hoye!
    </test>
    ]
    [update, /memory[1]/email_addr[1]/text()[1], syt@logilab.org]
    [rename_node, /memory[1]/junkbuster-method[1]@value, val]
    [append-first, /memory[1]/junkbuster-method[1],
    <newson/>
    ]
    [move-first, /memory[1]/spoken-languages[2]/language[2], /memory[1]/spoken-languages[1]]
    [update, /memory[1]/server-socket[2]@port, 7797]
    [remove, /memory[1]/spoken-languages[2]]



This give you a list of primitives to apply on file1 to obtain file2
(you should obtain file2 after the execution of all this script!). See
[4] and [5] for more information.
The script above tell you the 9 actions to apply on file1:

* insert after the node /memory/spoken-languages[0] the below xml subtree::

      <test>
      <!-- this is an append test -->
      hoye!
      </test>

* rename node /memory/mailbox[0] to "box"

* append a node <newson> to the node /memory[0]/junkbuster-method[0]

* append an attribute named "new" with value "new attribute" to the
  node /memory/spoken-languages[0]

* update attribute /memory/server-socket[1]@port value to "7797" 

* update text /memory/email_addr/text()[0] to "syt@logilab.org"

* rename attribute /memory/junkbuster-method[0]@value to "val"

* move the attributes "code" and "name" from
  /memory[0]/spoken-languages[0]/language[1] to 
  /memory[0]/spoken-languages[0]/language[0]
  and rename them to LogilabXmldiffTmpAttr:code and
  LogilabXmldiffTmpAttr:name 

* move the attributes "code" and "name" from
  /memory[0]/spoken-languages[0]/language[0] to 
  /memory[0]/spoken-languages[0]/language[1]
  and rename them to LogilabXmldiffTmpAttr:code and
  LogilabXmldiffTmpAttr:name 

* remove node /memory/spoken-languages/language[2]

* rename attributes LogilabXmldiffTmpAttr:code and
  LogilabXmldiffTmpAttr:name of /memory/spoken-languages/language[0]
  to name and code

* rename attributes LogilabXmldiffTmpAttr:code and
  LogilabXmldiffTmpAttr:name of /memory/spoken-languages/language[1]
  to name and code
    
Note all xpath are relative to the file1 with previous steps applied.

if you would have typed "xmldiff -x file1 file2", you would have
obtained the same thing described as an Xupdate output (see [3]). ::

    <?xml version="1.0"?> 
    <xupdate:modifications version="1.0"
     xmlns:xupdate="http://www.xmldb.org/xupdate">

      <xupdate:rename name="/memory[1]/mailbox[1]" >
    box
      </xupdate:rename>

      <xupdate:insert-after select="/memory[1]/junkbuster-method[1]" >
	<xupdate:element name="spoken-languages">
	  <xupdate:attribute name="new">
	new attribute
	  </xupdate:attribute>
	  <language code="it" name="italian"/>
	</xupdate:element>
      </xupdate:insert-after>

      <xupdate:insert-after select="/memory[1]/spoken-languages[1]" >
	<xupdate:element name="test">
	  <!-- this is an append test -->
    hoye!
	</xupdate:element>
      </xupdate:insert-after>

      <xupdate:update select="/memory[1]/email_addr[1]/text()[1]" >
    syt@logilab.org
      </xupdate:update>

      <xupdate:rename name="/memory[1]/junkbuster-method[1]@value" >
    val
      </xupdate:rename>

      <xupdate:append select="/memory[1]/junkbuster-method[1]"  child="first()" >
	<xupdate:element name="newson">
	</xupdate:element>
      </xupdate:append>

      <xupdate:remove select="/memory[1]/spoken-languages[2]/language[2]" />

      <xupdate:append select="/memory[1]/spoken-languages[1]" >
	<xupdate:element name="language">
	  <xupdate:attribute name="code">
	fr
	  </xupdate:attribute>
	  <xupdate:attribute name="name">
	english
	  </xupdate:attribute>
	</xupdate:element>
      </xupdate:append>

      <xupdate:update select="/memory[1]/server-socket[2]@port" >
    7797
      </xupdate:update>

      <xupdate:remove select="/memory[1]/spoken-languages[2]" />

    </xupdate:modifications>


Warnings
--------

* This version of xmldiff doesn't process the DTD, CDATA and
  PROCESSING INSTRUCTIONS nodes, so if there is a difference between two
  document in one of those nodes, xmldiff won't see it.

* Furthermore, xml namespaces are disabled:
  <xsl:transform  xmlns:xsl="..."/> and 
  <xslt:transform xmlns:xslt="..."/> 
  are seen as different nodes  
 
* Comparing document bigger than 200Ko can take a few minutes (during
  tests, it took at about 25 seconds to diff two versions of a 130Ko
  document on a Celeron 533 box with 256Mo RAM)

* The execution time is scaled to the number of differences between
  the documents to compare
 
* Finally, a few assumptions have been made to obtain the faster
  algorithm: 

  - there is an ordering <_l on the labels in the shema such that a node
    with a label l1 can appear as the descendent of a node with a label l2
    only if l1 <_l l2

  - for any leaf x from T1, there is at most one leaf y from T2 which
    can be mapped with x (internally, 2 node may be mapped together if
    their lcs (longest common subsequence) ratio is greater than 0.6)


References
----------

1. "Tree-to-tree correction for document trees"
   by D.T. Barnard, G. Clarke, N. Duncan 
   Queen's university, Kingston, Ontario K7L 3N6 (Canada), 1995
   The "ezs" algorithm

2. "Change detection in hierarchically structured information"
   by S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom 
   Stanford University, 1996 
   The Fast Match / Edit Script algorithm (fmes), used by default 

3. http://www.xmldb.org/xupdate/xupdate-wd.html#N19b1de
   XUpdate update language

4. http://www.w3.org/TR/2000/REC-xml-20001006
   XML 1.0 W3C recommendation

5. http://www.w3.org/TR/xpath
   XML path language 1.0 W3C recommendation


Feedback
--------

xmldiff discussion should take place on the xml-logilab mailing list.
Please check http://lists.logilab.org/mailman/listinfo/xml-projects for 
information on subscribing and the mailing list archives.
