DRBD 0.8 Roadmap
----------------

1 Drop support for linux-2.4.x.
  Do all size calculations on the base of sectors (512 Byte) as it
  is common in Linux-2.6.x.
  (Currently they are done on a 1k base, for 2.4.x compatibility)
  90% DONE

2 Drop the Drbd_Parameter_Packet.
  Replace the Drbd_Parameter_Packet by 4 small packets:
  Protocol, GenCnt, Sizes and State.
  The receiving code of these small packets is sane, compared
  to that huge receive_params() function we had before.
  40% DONE

3 Authenticate the peer upon connect by using a shared secret.
  Configuration file syntax:  net { cram-hmac-alg "sha1";
  shared-secret "secret-word"; }
  Using a challenge-response authentication.
  99% DONE

4 Consolidate state changes into a central function, that makes
  sure that the new state is valid. Replace set_cstate() with
  a force_state() and a request_state() function. Make all
  state changes atomic, and consolidate the many differenct
  cstate-error states into a single "NetworkFailure" state.
  50% DONE

5 Three configuration options, to allow more fine grained definition
  of DRBDs behaviour after a split-brain situation:

  In case the nodes of your cluster nodes see each other again, after
  an split brain situation in which both nodes where primary
  at the same time, you have two diverged versions of your data.

  In case both nodes are secondary you can control DRBD's
  auto recovery strategy by the "after-sb-0pri" options. The
  default is to disconnect.
     "disconnect" ... No automatic resynchronisation, simply disconnect.
     "discard-younger-primary"
                      Auto sync from the node that was primary before
                      the split brain situation happened.
     "discard-older-primary"
                      Auto sync from the node that became primary
                      as second during the split brain situation.
                      If discard-younger-primary and discard-older-primary
		      can not find a decissions, they fall back to
                      discard-least-changes.
     "discard-zero-changes"
                      Auto sync from the node that modified
                      blocks during the split brain situation, but only
		      if the target not did not touched a single block.
                      If both nodes touched their data, this policy
		      falls back to disconnect.
     "discard-least-changes"
                      Auto sync from the node that touched more
                      blocks during the split brain situation.
     "discard-node-NODENAME"
                      Auto sync _to_ the named node.

  In one of the nodes is already primary, then the auto-recovery
  strategie is controled by the "after-sb-1pri" options.
     "disconnect" ... always disconnect
     "consensus"  ... discard the version of the secondary if the outcome
                      of the "after-sb-0pri" algorithm would also destroy
                      the current secondary's data. Otherwise disconnect.
     "violently-as0p" Always take the decission of the "after-sb-0pri"
                      algorithm. Even if that causes case an erratic change
		      of the primarie's view of the data.
	              This is only usefull if you use an 1node FS (i.e.
		      not OCFS2 or GFS) with the allow-two-primaries
		      flag, _AND_ you really know what you are doing.
		      This is DANGEROUS and MAY CRASH YOUR MACHINE if you
		      have a FS mounted on the primary node.
     "discard-secondary"
                      discard the version of the secondary.
     "call-pri-lost-after-sb"
                      Always honour the outcome of the "after-sb-2sc"
                      algorithm. In case it decides that the current
                      secondary has the right data, it tries to make
                      the current primary secondary, if that fails
                      it calls the "pri-lost-after-sb" helper program
                      on the current primary. That helper program is
                      expected to halt the machine.

  In case both nodes are primary you control DRBD's strategy by
  the "after-sb-2pri" option.
     "disconnect" ... Go to StandAlone mode on both sides.
     "violently-as0p" Always take the decission of the "after-sb-0pri"
                      algorithm. Even if that causes case an erratic change
		      of the primarie's view of the data.
	              This is only usefull if you use an 1node FS (i.e.
		      not OCFS2 or GFS) with the allow-two-primaries
		      flag, _AND_ you really know what you are doing.
		      This is DANGEROUS and MAY CRASH YOUR MACHINE if you
		      have a FS mounted on the primary node.
     "call-pri-lost-after-sb"
	              Honor the outcome of the "after-sb-0pri" algorithm
                      and calls the "pri-lost-after-sb" program on the
		      other node. That helper program is expected to
                      halt the machine.

  Defaults:
  after-sb-0pri = disconnect;
  after-sb-1pri = disconnect;
  after-sb-2pri = disconnect;

  DRBD-07 was:
  after-sb-0pri = discard-younger-primary;
  after-sb-1pri = consensus;
  after-sb-2pri = disconnect;

  NB: To allow the user to resolve from such situations manually
      the "drbdadm connect" command (this is the "drbdsetup net"
      command) gets a short-living flag called "--discard-my-data".
  99% DONE

6 It is possible that a secondary node crashes a primary by
  returning invalid block_ids in ACK packets. [This might be
  either caused by faulty hardware, or by a hostile modification
  of DRBD on the secondary node]

  Proposed solution:

  Have a hash table (hlist_head style), add the collision
  member (hlist_node) to drbd_request.

  Use the sector number of the drbd_request as key to the hash, each
  drbd_request is also put into this hash table. We still use the
  pointer as block_id.

  When we get an ACK packet, we lookup the hash table with the
  block_id, and may find the drbd_request there. Otherwise it
  was a forged ACK.

  Note: The actual key to the hash should be (sector & ~0x7).
        See item 9 for more details.
  99% DONE

7 Handle split brain situations; Support IO fencing;

  New commands:
    drbdadm outdate r0

    When the device is configured this works via an ioctl() call.
    In the other case it modifies the meta data directly by
    calling drbdmeta.

  remove option: on-disconnect

  New meta-data flag: "Outdated"

  introduce:
  disk {
    fencing [ dont-care | resource-only | resource-and-stonith ];
  }

  handlers {
    outdate-peer "some script";
  }

  If the disk state of the peer is unknown, drbd calls this
  handler (yes a call to userspace from kernel space). The handler's
  returncodes are:

  3 -> peer is inconsistent
  4 -> peer is outdated (this handler outdated it) [ resource fencing ]
  5 -> peer was down / unreachable
  6 -> peer is primary
  7 -> peer got stonithed [ node fencing ]

  Let us assume that we have two boxes (N1 and N2) and that these
  two boxes are connected by two networks (net and cnet [ clinets'-net ]).

  Net is used by DRBD, while heartbeat uses both, net and cnet

  I know that you are talking about fencing by STONITH, but DRBD is
  not limited to that. Here comes my understanding of how resource fencing
  should works with DRBDv8 :

   N1  net   N2
   P/S ---  S/P     everything up and running.
   P/? - -  S/?     network breaks ; N1 freezes IO
   P/? - -  S/?     N1 fences N2:
                    In the STONITH case: turn off N2.
                    In the resource fencing case:
                    N1 asks N2 to fence itself from the storage via cnet.
                    HB calls "drbdadm outdate r0" on N2.
                    N2 replies to N1 that fencing is done via cnet.
                    The outdate-peer script on N1 returns sucess to DRBD.
   P/D - -  S/?     N1 thaws IO

  N2 got the the "Outdated" flag set in its meta-data, by the outdate
  command.

  The fencing is set to resource-only enables this behaviour. In the
  resource-only case the outdate-peer handler should have a return
  value of 3, 4, 5 or 6, but should not return 7.

  In case "fencing" is set to "resource-and-stonith", all IO operations
  get immediately frozen (even all currently outstanding IO operations
  will not finish) upon loss of connection.

  Then the "outdate-peer" handler is started. In this configuration
  the outdate peer handler might return any of the documented return
  values.

  When the outdate-peer handler returns IO is resumed.

  Notes:
  * Why do we need to freeze IO in the "resource-and-stonith" case:
      Stonith protects you when all communication pathes fail. In
      that case both (isolated) nodes try to stonith each other.
      If the current primary would continue to allow IO it could
      accept transactions, but could get stonithed by the
      currently secondary node.
      -> Therefore others could see commited transactions that
         would be gone after the successfull stonith operation.

  * The outedate peer handler also gets called if an unconnected
    secondary wants to become primary.
    In other words it only may become primary when it knows that
    the peer is outdated/inconsistent.

  * We need to store the fact that the peer is outdated/inconsistent
    in the meta-data. To allow an stand allone primary to be rebooted.

  * The outdate-peer program gets two environment variables:
    DRBD_RESOURCE the name of the DRBD-resource and DRBD_PEER
    the host name of the peer.

  99% DONE

8 New command drbdmeta

  We move the read_gc.pl/write_gc.pl to the user directory.
  Make them to one C program: drbdmeta
   -> in the future the module never creates the meta data
      block. One can use drbdmeta to create, read and
      modify the drbdmeta block. drbdmeta refuses to write
      to it as long as the module is loaded (configured).

  drbdsetup gets the ability to read the gc values while DRBD
  is set up via an ioctl() call. -- drbdmeta refuses to run
  if DRBD is configured.

  drbdadm is the nice front end. It always uses the right
  back end (drbdmeta or drbdsetup)...

  drbdadm set-gi 1:2:3:4:5:6 r0
  drbdadm get-gi r0
  drbdadm md-create r0

  md-create would ask nasty questions about whether you are really
  sure and so on, and do some plausibility checks first.
  md-set would be undocumented and for wizards only.
  80% DONE

9 Support shared disk semantics  ( for GFS, OCFS etc... )

    All the thoughts in this area, imply that the cluster deals
    with split brain situations as discussed in item 6.

  In order to offer a shared disk mode for GFS, we allow both
  nodes to become primary. (This needs to be enabled with the
  config statement net { allow-two-primaries; } )

 Read after write dependencies

  The shared mode is available to clusters using protocol C.
  It is not usable with protocol A or B.

 Global write order

  [ Description of GFS-mode-arbitration2.pdf ]

  1. Basic mirroring with protocol C.
    The file system on N2 issues a write request towards DRBD,
    which is written to the local disk and sent to N1. Then
    the data bock is written to the local disk here and and
    acknowledge packet is sent back. As soon as both the
    write to the local disk and the ACK from N1 reach N2,
    DRBD signals the completion of IO to the file system.

    The major pitfall is the handling of concurrent writes to the
    same block. (Concurrent writes to the same blocks should not
    happen, but we have to assume that it is possible that the
    synchronisation methods of our upper layer [i.e. openGFS]
    may fail.)

    There are many cases in which such concurrent writes would
    lead to different data on our two copies of the block.

  *** FIXME ***
  description of algorithm here is out of date,
  we handle things slightly differently now in the code.

  2. Concurrent writes, network latency is lower than disk latency
    As we can see on the left side in figure two this could lead
    to N1 has the blue version (=data from FS on N2) while N2
    ends with having the green version (=data from FS on N1).
    The solution is to flag one node (in the example N2 has the
    discard-concurrent-writes-flag).
    As we can see on the right side, now both nodes ends with
    the blue data.

  3. Concurrent writes, high latency for data packets.
    The problem now is that N2 does can not detect that this was
    a concurrent write, since it got the ACK before the conflicting
    data packets comes in.
    This can happens since in DRBD, data packets and ACK packets are
    transmitted via two independent TCP connections, therefore the
    ACK packet can overtakes a data packet.
    The solution is to send with the ACK packet a discard info packet,
    which identifies the data packet by it sequence number.
    N2 will keep this discard info as long as it has not seen higher
    sequence numbers by now.
    With this both nodes will end with the blue data.

  4. Concurrent writes, high latency for data packets.
    This is the inverse case to case3 and already handled by the means
    introduced with item 1.

  5. New write while processing a write from the peer.
    Without further measures this would lead to an inconsistency in
    our mirror as the figure on the left side shows.
    If we currently write a conflicting block from the peer, we simply
    discard the write request from our FS and signal IO completion
    immediately.

  6. High disk latency on N2.
    By IO reordering in the layers below us this could lead to
    having the blue data on N2 and the green data on N1.
    The solution to this case is the delay the write to the local
    disk on N2 until the local write is done. This is different from
    case two since we already got the write ACK to the conflicting
    block.

  7. An data packet overtakes an ACK packet on the network.
    Although this case is quite unlikely, we have to take int into
    account. From N2's point of fiew this looks a lot like case 4,
    but N2 should not delete the data packet now!

 Proposed solution

  We arbitrary select one node (e.g. the node that did the first
  accept() in the drbd_connect() function) and mark it withe the
  discard-concurrent-writes-flag.

  Each data packet and each ACK packet gets a sequence
  number, which is increased which every packet sent.
  (This is a common space of sequence numbers)

  The algorithm which is performed upon the reception of a
  data packet [drbd_receiver].

  *  If the sequence number of the data packet is higher than
     last_seq+1 sleep until last_seq+1 == seq_num(data packet)
     [needed to satisfy example case 7]

  1. If the packet's sequence number is on the discard list,
     simply drop it.
     [ ex.c. 3]
  2. Do we have a concurrent request? (i.e. Do I have a request
     to the same block in my transfer log.) If not -> write now.
     [ default ]
  3. Have I already got an ACK packet for the concurrent
     request ? (Has the request the RQ_DRBD_SENT bit already set)
     If yes -> write the data from the data packet afterwards.
     [ ex.c. 6]
  4. Do I have the "discard-concurrent-write-flag" ?
     If yes -> discard the data packet.
     If no -> Write data from the data packet afterwards and set
              the RQ_DRBD_SENT bit in the request object ( Since
              will will not get an ACK from our peer). Mark the
	      ee to prepend the ACK packet with a discard info
	      packet.
     [ ex.c. *]

  The algorithm which is performed upon the reception of an
  ACK packet [drbd_asender]

  * If we get an ACK, store the sequence number in last_seq.

  The algorithm which is performed upon the reception of an
  discard info packet [drbd_asender]

  * if the current last_seq is lower the the packet that should
    be discarded, store it in the to discard list.

  BTW, each time we have a concurrent write access, we print
  a warning to the syslog, since this indicates that the layer
  above us is broken!

  Note: In Item 6 we created a hash table over all requests in the
        transfer log, keyed with (sector & ~0x7). This allows us
        to find IO operations starting in the same 4k block of
        data quickly. -> With two lookups the hash table we can
	find any concurrent access.
  99% DONE

10 Change Sync-groups to sync-after

  Sync groups turned out to be hard to configure and more
  complex setups, hard to implement right and last not least they
  are not flexible enough to cover all real world scenarios.

  E.g. Two physical disks should be mirrored with DRBD. On one
       of the disks there is only a single partition, while the
       other one is divided into many (e.g. 4 smaller) partitions.
       One would want to sync the big one in parallel to the
       4 small ones. While the resync process of the 4 small
       ones need to be serialized.
       -> With the current sync groups you can not express
          this requirement.

  Remove config options   syncer { group <number>; }
  Introduce config options   syncer { after <resource>; }
  99% DONE
      Finished the implementation. Tested.

11 Take into account that the two systems could have different
  PAGE_SIZE.

  At least we should negotiate the PAGE_SIZE used by the peers,
  and use it. In case the PAGE_SIZE is not the same inform
  the user about the fact.

  Probably a general high performance implementation for this
  issue is not necessary, since clusters of machines with
  different PAGE_SIZE are of academic interest only.
  100% DONE by item 15

12 Introduce a "common" section in the config file. Option
  section (like handlers, startup, disk, net and syncer)
  are inherited from the common section, if they are not
  defined in a resource section.
  99% DONE

13 Introduce an UUID (universally unique identifier) in the
  meta data. One purpose is to tag the bitmap with this UUID.
  If the peer's UUID is different to what we expect we know that
  we have to do a full sync....
  99% DONE
  -> Will be go out again, and become replaced by UUID for data
     generations. See item 16

14 Sanitize ioctls to inlcude a standard device information struct
  at the beginning, including the expected API version.
  Consider using DRBD ioctls with some char device similar to
  /dev/mapper/control

  The new interface is now based on netlink (actually connector).
  It is based on the concept of tag lists. The idea is that on the
  interface we pass lists (actually arrays) of tags. Where each
  tag identifies the following sniplet of data.
  Each tag also states if it is mandatory.

  In case we have to add a new value to the interface, the
  existing userland tools continue to work with newer kernel
  modules and vice versa. (Only the older part of the two will
  inform the user with a warning, that there was a unknown
  tag on the interface, and that the unknown tag got ignored)
  But the basic functionality stays intact!

  While implementing this, we also implemented dynamic device allocation.

  drbdsetup is basically call compatible to its ioctl based
  ancestor, but has two now options:

    --create-device ___ create the device in case int does not exist yet.
    --set-defaults ____ set all not mentioned options to it's default values.

  Things to do:

  * Locking in userspace, to prevent multiple instances of drbdsetup
  * Think about locking in kernel space ( device_mutex? )

  80% DONE

15 Accept BIOs bigger than one page, probabely up to 32k (8 pages)
  currently.
  * Normal Requsts. -> DONE
  * Make the syncer to commulate adjacent bits into bigger requests. -> DONE
  * Make the bitmap more coarse grained. -> TODO
  66% DONE

16 Displace the current generation-counters with a data-generation-UUID
   concept.
  The current generation counters have various weaknesses:
   * In a split braine'd cluster the appliance of the same events
     to both cluster nodes could lead to equal generation-counters
     on both nodes, while the data is not in sync for sure.
   * They are completely unsuitable if a 3rd node is used for
     e.g. weekly snapshots.
   * Gracefull takeover while disconnected not possible.

  We associate each data generation with an unique UUID (=64 bit random
  number). A new data generation is created if a primary node is
  disconnected from its secondary and when a degraded secondary
  becomes primary for the first time.

  In the meta-data we store a few generations-UUIDs:
   * current
   * bitmap
   * history[2]

  As well as the currently known flags:
   Consistent, WasUpToDate, LastState, ConnectedInd, WantFullSync

  When the cluster is in Connected state, then the bitmpat gen-UUID
  is set to 0 (Since the Bitmap is empty). When we create a new current
  gen-UUID while we are disconencted the (old) current gets backed-up
  to the bitmap gen-UUID. (This allowes us to identify the base of
  of the bitmap later)

  Special UUID values:
  JustCreated [JC] ___  4

  ALGORITHMS

  Upon Connect:
      self   peer   action
  1.  C=JC   C=JC   No Sync
  2.  C=JC   C!=JC  I am SyncTarget setting BM
  3. C!=JC   C=JC   I am SyncSource setting BM
  4.   C   =   C    Common power [off|failure](Examine the roles at crash time)
  4.1  sec   sec    Common power off, no sync.
  4.2  pri   sec    Common power failure, I am SyncSource using BM
  4.3  sec   pri    Common power failure, I am SyncTarget using BM
  4.4  pri   pri    Common power failure, resync in arbitrary direction.
  5.   C   =   B    I am SyncTarget using BM
  6.   C   = H1|H2  I am SyncTarget setting BM
  7.   B   =   C    I am SyncSource using BM
  8. H1|H2 =   C    I am SyncSource setting BM
  9.   B   =   B    [ and B != 0 ] SplitBrain, try auto recover strategies.
  10 H1|H2 = H1|H2  SplitBrain, disconnect.
  11.               Warn about unrelated Data, disconnect.

  Upon Disconnect:
   Primary:
      Copy the current-UUID over to the bitmap-UUID, create a new
      current-UUID.
   Secondary:
      Nothing to do.

  Upon becomming Primary:
   In case we are disconnected and the bitmap-UUID is emptry, copy the
   current-UUID over to the bitmap-UUID and create a new current-UUID.
   Special-case: primary with --do-what-I-say, clearing the inconsistent
                 flag causes a new UUID to be generated.

  Upon start of resync:
   Clear the consistent-flag on the SyncTarget. Generate a new UUID for
   the bitmap-UUID of the SyncSource and the current-UUID of the SyncTarget.

  Upon finish of resync:
   Set the bitmap-UUID to 0. The SyncTarget addopts the current-UUID
   of the SyncSource, and sets its consistent-flag.

  When the bitmap-UUID gets cleared, move the previous value to H1.
  In case H1 was already set copy its previous value to H2. Etc..

  For the auto recover strategies after split brain (see item 5)
  it is neccessary to embedd the node's role into the UUIDs.
  This is masked out of course when the UUIDs are compared.

  * Note1: Discontinue the --human and --timout options when
           becoming primary.
           NB: If they are needed, I think they can be implemented
               as special UUID values.

  99% DONE. Kernel part is implemented, userland parts are implemented,
	    --humand and --timeout-expired are removed.
	    Everything seems to work so far.

  Known issues: we have to define behaviour for two-primaries case,
  and for connection loss when Primary with local disk != UpToDate.

17 Something like

   drbdx: WARNING disk sizes more than 10% different

  would be nice at (initial) full sync.
  drbdx: WARNING disk sizes more than 10% different

18 Connection-Teardown Packet. Currently the new state-checks
  disallows "drbdadm disconnect res" on the primary node of a
  connected cluster.
  Thes Teardown Packet causes the secondary-node to outdate
  its data and to close the connection in one go.
  99% DONE.

19 Make the updates to the bitmap transactional. Esp for resizing.
  Make updates to the superblock transactional

20 There are quite a number of parameters that must be set equal
   (or some reciprocal) on the two nodes. We need to ensure that
   the config is valid, from a viewpoint of the whole cluster.
   E.g.
   protocol				equal
   after-sb-0pri / discard-local/remote	equal / reciprocal
   after-sb-1pri			equal
   after-sb-2pri			equal
   want_lose				reciprocal
   two_primaries			equal
  99% DONE

21 Write barriers in the kernel
  In Linux-2.6 write barriers in the block-io layer are represented as
  REQ_SOFTBARRIER, REQ_HARDBARRIER and REQ_NOMERGE flags on requests.
  In the BIO layer this is BIO_RW_BARRIER, which is usually set on
  BIO_RW (=write) requests.

  The REQ_HARDBARRIER bit is currently used to do a cache flush on
  IDE devices. Actually not all IDE devices can do cache flushes, there
  are some older models out there that can do write-caching but can
  not perform a cache flush!

  Journaling file systems should use this barrier mechanism in their journal
  writes (actually on the commit block, this is the last write in a
  transactional updated to the jouernal).

  As for DRBD we should probabely ship the REQ_HARDBARRIER flags with
  our wire protocol (or should they be expressed by Barrier packets?)

  We will only see such REQ_HARDBARRIER flags if we state to the upper layers
  that we are able to deal with them. We need to do this by announcing it:
  blk_queue_ordered(q, QUEUE_ORDERED_FLUSH or QUEUE_ORDERED_TAG ) .
  Default ist QUEUE_ORDERED_NONE. This is the reason why we never see
  the REQ_HARDBARRIER flag currently.

  An other consequence of this is, that IDE devices that do _not_ support
  cache flushes and have write cache enabled are inherent buggy to use with
  a journaled file system.

  SCSI's Tagged queuing (seems to be presenet in SATA as well)
    [excerpt from http://www.scsimechanic.com/scsi/SCSI2-07.html]

    Tagged queuing allows a target to accept multiple I/O processes from
    the same or different initiators until the logical unit's command queue
    is full.

    If only SIMPLE QUEUE TAG messages are used, the target may execute the
    commands in any order that is deemed desirable within the constraints
    of the queue management algorithm specified in the control mode page
    (see 8.3.3.1).

    If ORDERED QUEUE TAG messages are used, the target shall execute the
    commands in the order received with respect to other commands received
    with ORDERED QUEUE TAG messages. All commands received with a SIMPLE
    QUEUE TAG message prior to a command received with an ORDERED QUEUE
    TAG message, regardless of initiator, shall be executed before that
    command with the ORDERED QUEUE TAG message. All commands received with
    a SIMPLE QUEUE TAG message after a command received with an ORDERED
    QUEUE TAG message, regardless of initiator, shall be executed after
    that command with the ORDERED QUEUE TAG message.

    A command received with a HEAD OF QUEUE TAG message is placed first in
    the queue, to be executed next. A command received with a HEAD OF
    QUEUE TAG message shall be executed prior to any queued I/O
    process. Consecutive commands received with HEAD OF QUEUE TAG messages
    are executed in a last- in-first-out order.

  I think in the context of SCSI the kernel usually issues write requests
  with the SIMPLE QUEUE TAG, and requests with the REQ_HARDBARRIER
  (i.e. bio's with the BIO_RW_BARRIER) with an ORDERED QUEUE TAG.

  What QUEUE_ORDERED_ type should we expose ?

    In order to support capable IDE devices right, we should ship the
    BIO_RW_BARRIER bit with our data packets in case the peer's backing
    storage is of the QUEUE_ORDERED_FLUSH type.

    If both devices are of the QUEUE_ORDERED_TAG type should also claim
    to be of that type, and ship the BIO_RW_BARRIER bit as well.

    self   peer      DRBD
    ---------------------
    NONE , NONE  =>  NONE
    NONE , FLUSH =>  NONE
    NONE , TAG   =>  NONE
    FLUSH, NONE  =>  NONE
    FLUSH, FLUSH =>  FLUSH
    FLUSH, TAG   =>  FLUSH
    TAG,   NONE  =>  NONE
    TAG,   FLUSH =>  FLUSH
    TAG,   TAG   =>  TAG

  How should we deal with our self generated barrier packets ?

    In case our backing device is of the QUEUE_ORDERED_NONE class, we
    have to stay with the current code.

    In case our backing device only supports QUEUE_ORDERED_FLUSH we
    will to use the current code. That means, when we receive a write
    barrier packet we wait until all of our pending local write
    requests are done. (This potentially causes congestion on the TCP
    socket...)

    In cause our backing device's queue properties are set to
    QUEUE_ORDERED_TAG we offload the complete barrier logic to the
    backing storage device:

    * When we receive a barrier packet
      - If we have no local pending requests, we send the barrier ACK
        immediately. (= current code)
      - If the last_barrier_write member of mdev points to an epoch_entry
        we set bit 31 of bnum.
      - If we have local pending requests, we set a flag that the next
        data packet has to be written with the BIO_RW_BARRIER flag.
        (That flag should be called BARRIER_NEEDED)

    * When receiving data packets we test_and_clear BARRIER_NEEDED,
      and add set the BIO_RW_BARRIER on the write request. We also set
      the last_barrier_write member of mdev.
      [Normal writes clear the last_barrier_write member of mdev]

    * When a write completes and it has the bnum set, send the barrier
      ack before sending the ack for the write. In case the highest
      bit of bnum is set as well, also send the barrier ack following
      the write ack of the data packet.

  90% DONE [ Not tested yet. ]

22 Reboot notifier.

23 External imposed SyncPause states.
   There are two new commands: 'drbdadm pause-sync res'
                               'drbdadm resume-sync res'
   These may be used to suspend the resynchronisation process while
   e.g. the backing storages' raid controller does its resynchronisation.

   While implementing this, I also made shure that in a 3 node
   setup the two peers of a connection will agree if a resynchronisation
   is paused under all conditions you can think of, if there are more
   than two nodes!

   99% DONE

24 Make it possible to hot-add disk drives == Atomic configuration changes.

   99% DONE

25 Add reserved fields to DRBD-meta-data, add a bytes per bit field to
   metadata.

   99% DONE

26 Implement a kind of "dstate" command to make integration with
   Heartbeat-2.0's master/slave-support possible.

   99% DONE

27 Remove all explicit drbd_md_write() calls, and create a mechanism,
   that always keeps the on disk-metadata up-to-date implicit.
   Calling drbd_md_write() explicit is too errorprone.

   99% DONE

28 Implement a kind of 'call home', a single HTTP get request, that
   gets counted in a data base. The initiator calculates a simple
   hash over the machine and resource names. Each time a meta-data
   set gets generated, the 'call home' is initiated. The user might
   of course opt out of this.

   99% DONE

29 Make drbdadm to have 'hidden-commands' command to also show
   the hidden sub-commands in the ussage.

   99% DONE

30 The current drbdadm_scanner is 1MB in source and as binary.
   Use a _basic_ flex scanner, and a hand written parser for superb
   errror reporting.

   99% DONE

31 Resizing several GB results in ko-count timeouts, maybe since the
   secondary node does the enlargement of the bitmap in the receiver (?)

   DONE, by using the async bitmap IO code.

32 drbdmeta: with internal meta-data v07 and v08 meta-data super blocks
   are in different places. -> It is possible to have v07 AND v08 meta
   data on one device.
   => drbdmeta should make sure that it overwrites the other location
      in case it create a meta-data block.

   99% DONE

33 Serialize state changes like secondary -> primary and
   Connected -> SyncSource in the cluster.

      role <- primary
      conn <- StartingSyncT (disk <- inconsistent)
      conn <- StartingSyncS (pdsk <- inconsistent)
      disk <- Diskless (as long as it happens as administrative command)
      pdsk <- Outdated (= a 'disconnect' issued on a primary node)

   * When a state change might sleep ( reuqest_state() ) and it is
     to be cluster wide atomic ( pre_state_checks() determines this!).
	1. Aquire the cluster state change lock (bit & waitqueue) ?
	2. We send a request_state packet.

   * When a request_state packet is received

	1. * If we are UNIQUE we take the cluster lock (potentially
	     waiting for it) and try to apply the remote's request
	     as soon as we have the lock.
	   * When we are not UNIQUE we try to apply the state change
	     immediately (without taking the cluster lock).
	2. We send the ACK / NACK.
	   ( Do we actually need an ACK/NACK ?
 		* On the not UNIQUE side, we will fail the request as
		  soon as the offending state request comes in.
                * On the UNIQUE side we need to positive ACK to
		  continue.
		) I guess for the sake of completeness, we should
                  have both packets, although currently the need for
		  the NACK packet is not abvious.

   * When we receive an ACK / NACK we either sucessfully finish or
     fail the the request_state() call. (Error codes should be passed
     from the peer.)

   * When the connection failes ( = actually a non-cluster wide state
     change happens while a cluster wide state change goes on), we
     need to re-evaluate the pre state change check. In case the
     pre state change check allows the new state we can procees,
     otherwise we need to fail the request.

   * How to do the synchronisation form the receive of the ACK / NACK
     packet to the termination of the request_state() function ?
       * wait_queue & bit.

   DATA STRUCTURES:
	* A CLUSTER_STATE_CHANGE bit == the cluster lock bit.
	* A CL_ST_CHG_SUCCESS  bit set by the receiver.
	* A CL_ST_CHG_FAIL     bit set by the receiver.
	* A wait queue.

   TODOS:
    Evaluate if it is possible to use it for starting resync. (invalidate)
    Evaluate it for the other cases...

  90 % Is implemented. Changing the role to primary already uses this
       mechanism. Starting resync with invalidate and invalidate_remote
       now also uses this method. Detaching now also uses this mechanism.

34 Improve the initial hand-shake, to identify the sockets (and TCP-
   links) by an initial message, and not only by the connection timming.

   99% DONE

35 Bigger AL-extents (e.g. 16MB)

36 Increase the number of UUID history slots.

37 In case heartbeat (or some one else) makes us primary, we need to
   check first if the peer is alive.
   Currently we habe a problem is when heartbeat's dead time is smaller
   than DRBD's network timeout.

38 Create an other on-io-error hander, that does retry failed read
   operations on the peer, but does not detach from the local disk.
   And it sets that block in the bitmap as out-of-date.

   Simon works on this.

39 Send mirrored write requests out of the worker context.
   99% DONE

40 Do something with FLUSHBUFS ioctl.

41 Fix DRBD's behaviour in case of a common power failuer and when
   both nodes were in primary role.

   See the the Algorithm of Item 16, section 4 to 4.4 .

   Further we need to have the resync rolces conflict  "rr-conflict"
   strategy option with the following values:

   The available options are:
     "disconnect" ... No automatic resynchronisation, simply disconnect.
     "violently" .... Sync to the primary node is allowed, violating the
	              assumption that data on a block device is stable
		      for one of the nodes. DANGEROUS, DO NOT USE.
     "call-pri-lost"
                      Call this helper program on one of the machines.
                      This program is expected to halt or reboot the
                      machine.

   An exception of course is a primary disk-less node that gets a disk
   attached. Such a nodes becomes sync target, but since it does not
   show a violently data change, this state transition is always allowed.

   99% DONE

42 Forward port the abilitiy to resume the TL after IO was frozen,
   in case the connection is reestablished again.

43 Fix indexed meta-data.

44 Callbacks to userspace should run asynchronous.

Maybe:

*  Switch to protocol C in case we are running without a local
   disk and are configured to use protocol A or B.

*  Dynamic misc char device instead of IOCTLs for configuration. Evaluate
   if the configuration could be done over a netlink socket as well...

*  A netlink socket to communicate events to userspace.
   - All state changes
   - the need to outdate the peer

*  Write some heartbeat glue to do a gracefull switchover in case of
   a local IO failue. (requires the netlink socket thing)

plus-banches:
----------------------

1 Make use-csums to use the kernel's crypto API

2 Implement online verification

3 Change the bitmap code to work with unmapped highmem pages, instead
  of using vmalloc()ed memory. This allows users of 32bit platforms
  to use drbd on big devices (in the ~3TB range)

4 3 node support. Do and test a 3 node setup (2nd DRBD stacked over
  a DRBD pair). Enhance the user level tools to support the 3 node
  setup.

5 Have protocol version 74 available in drbd-0.8, to allow rolling
  upgrades

