Server Protocol
---------------

  Broadly, a server services three types of protocol:

     1) initial handshake
     2) reconnect handshake
     3) data cycle

  There is no stop protocol.  Stop is by timeout on request handling or
  close of connection, by signal, or by receipt of a new connection from
  the same client IP as before.
  
  Some aspects of control are handled by an auxiliary daemon
  (enbd-sstatd) which communicates via signal.

  The "initial handshake" takes place on a control connection designated
  on the server commandline (there is no default nor lookup in a
  directory service such as /etc/services).  Standardly, the control
  connection is made via a tcp port, and results in the server
  immediately forking off a child process that handles the initial
  handshake.  The child handles the handshake while the original daemon
  continues listening for new connects.

  The intial handshake :

   a) serves certain data, such as the size of
      the served resource and whether it is readonly or not. This
      data includes:
      i) details of a session connection, standardly a tcp port, on which
         further transactions will take place after the handshake,
      ii) a secret that will be used on subsequent reconnects as an
          authentifier of the server to the client;
   b) receives certain data representing client configuration,
      such as a blocksize for subsequent data transactions;
   c) responds with an agreed value for such items as have to be
      mutually agreed, such as the blocksize.

  The handshake protocol is ascii, and the order of the elements of
  the handshake is determined by the server, not the client. The client
  is prepared to accept all elements but the initial hello element in
  any order.

  On receipt of a valid handshake the server will store the details of
  the client IP address in a permanent database on disk (usually
  /var/state/nbd/*-id.client_ips) which will be used to maintain state
  across reboots via the auxiliary daemons.

  Further details will be given below.

  After the initial handshake is completed, the server spawns a daemon
  (the "session master") listening on the agreed session port.

  The reconnect handshake :
  
  takes place on the session connection at the moment of its first
  estabishment and consists of an abbreviated form of the initial
  handshake.  It gives way to the data cycle on that connection on its
  completion.

  The daemon, on receiving a valid reconnect handshake from a client
  (identified by IP address) that it is already serving will close and
  kill its existing slave handlers and start new handlers for the
  replacement connect.

  Further details will be given below.

  The data cycle :

  takes place on the channel established by a reconnect handshake.
  It broadly consists of

     a) receiving a "request header", optionally follwed by data;
     b) performing an action;
     c) replying with a "reply header", optionally followed by data.

  In addition, steps b) and c) may be interchanged if the server is
  started with flags specifying "async" operation.

  Data (if present) always follows the header and is not terminated
  with any trailer. The data size is declared in the header.

  The reply is always sent, even if the client is set to discard it.
  The send order of replies is not significant. The reply contains an
  identifier that matches and must match that of the request it replies
  to.  Replies may be repeated, and excess replies will be discarded by
  the client.  The client may repeat requests as a result of timeouts.
  The server may maintain a cache of request identifiers in order to be
  able to acknowledge requests already treated once with a consistent
  reply.

  The server must attempt to maintain the sequence order of requests on
  disk, as determined by the embedded sequence number received,
  according to policy.  The policy may be null.  The default is to
  maintain the order of write requests, unless that implies holding up
  another request for more than 1s. The client will number write
  requests sequentially.  Read requests will be numbered with the
  sequence number of the preceding write request. A read with sequence
  number n cannot be serviced until the write with sequence number n
  has been treated.

  If a request/reply cycle is not completed within 30s, the server
  thread handling it must close its connection and die.  A new thread
  will replace it if and when the client reconnects.



Details of the server data cycle
--------------------------------

  Request and reply headers consist of closepacked data items
  individually in network byte order, the whole being padded to a
  multiple of 64 bytes length in total by trailing zeros.  Each data
  item is a multiple of 4 bytes in length.

  Request header:

  size    sign  name   function
  ------  ----  ----   --------
  4 bytes    u  magic  fixed identifier for ENBD requests:  0x25609513
  4 bytes    u  type   code for the kind of request, READ, WRITE, etc.
  4 bytes    u  handle unique identifier for the request
  8 bytes    u  from   offset "ondisk" to which the request refers
  4 bytes    u  len    size of following data or area ondisk
  4 bytes    u  flags  extra controls
  8 bytes    u  time   microseconds since the epoch when request issued
  8 bytes    s  zone   client timezone indicator (only low bytes used)
  4 bytes    u  seqno  sequence number for the request on the client
  16 bytes   u  data   (ordered as a 4x4bytes array) for a (md5) checksum
                       or other special in-request data.

  that is, 72 bytes rounded up to 128.


  Reply header:

  size    sign  name   function
  ------  ----  ----   --------
  4 bytes    u  magic  fixed identifier for ENBD replies:   0x67446698
  4 bytes    s  error  return status from action executed as requested
  4 bytes    u  handle unique identifier for the request serviced
  4 bytes    u  flags  code for kind of reply and other controls
  8 bytes    u  time   microseconds since the epoch when reply issued
  8 bytes    s  zone   server timezone indicator (only low bytes used)
  16 bytes   u  data   (ordered as a 4x4bytes array) for a (md5) checksum
                       or other special in-reply data.

  that is, 48 bytes rounded up to 64 bytes.

  There are 5 different kinds of request, determined by the "type"
  field in the request, as follows:

  code   name    function
  ----   ----    -------
     0   READ    request to read data from the server
     1   WRITE   request to write data to the server
     2   IOCTL   request to execute an ioctl on the server
     3   CKSUM   request for the checksum of an area on disk
     4   SPECIAL has a special interpretation

  Of these, only READ and WRITE requests are usually received and must
  be dealt with by the server.
  
  That includes READ requests for 0 bytes, which have a special
  interpretation as a "server check medium" command.  These will
  normally be emitted every second or so as keep alives and deadman
  probes by the client.  The server must check its resource (normally at
  the indicated offset) and respond appropriately.

  READ requests :

  The normal data cycle on the server is ..

      0) receive request header
      1) read indicated data area on disk
      2) send reply header
      3) send data

   If the server is in async mode, it may interchange steps 1) and
   2).  In that case, if an error occurs it must be signalled in the
   reply to the next request (this part of the protocol needs
   elaboration).

   Steps 1) and 3) may then also be intermixed, i.e. carried out
   interleaved.
   
   The error field in the reply will be 0 iff the request was satisfied
   correctly (currently nothing more detailed is implemented).

   A request to read 0 bytes must be respected. The server should
   manoeuver to the correct spot on disk in order to check it.

   WRITE requests :

   The data cycle is ...

      0) receive request header
      1) receive following data
      2) write received data to disk
      3) send reply header

   If the server is in async mode, it may interchange steps 2) and 3).
   In that case, if an error occurs it must be signalled in the reply to
   the next request (this part of the protocol needs elaboration).

   Steps 1) and 2) may also be intermixed, i.e.  carried out
   interleaved.

   The error field in the reply will be 0 iff the request was satisfied
   correctly (currently nothing more detailed is implemented).

   IOCTL requests :

   The data cycle is ...

      0) receive request header
      1) receive any following data
      2) interpret and carry out ioctl
      3) send reply header
      4) send any following data

   ioctls have a complex organisation and treatment. The data for the
   ioctl is scattered around the incoming request:

   upper 4 bytes of request from field = ioctl type designator
   lower 4 bytes of request from field = ioctl argument parameter

   There is no other convenient place to store these, since all 32 bits
   of these ioctl fields carry information.

   In particular, the ioctl type designator has an interpretation
   defined in the kernel's ioctl.h file. The top two bits are obtained
   via the _IOC_DIR macro, and classify ioctls into four kinds each of
   which have implications for data treatment and expectations.

   value kind        explanation
   ----  ----        ------------
    0   _IOC_NONE    argument is ignored
    1   _IOC_WRITE   argument is explicit data to be written
    2   _IOC_READ    argument is an indirection to data to be read
    3   (union)      argument is an indirection to data to be written

    The next 14 bits of the ioctl type designator are a size indicator
    for the indirected zone. The lowest 16 bits are the numerical
    identifier for the ioctl.

    When _IOC_READ is set, the server must provide a buffer to which or
    from which the data in the request is copied, and the address of
    which will be the argument in the local ioctl it constructs. The
    buffer must be of the size indicated.

    Because the size indicator in the ioctl type designator itself is
    frequently inaccurate (programmer error or misunderstanding or
    unawareness of the significance, or else it is simply not statically
    computable), the client repairs the value before passing it to the
    server, and the server must transform it back before executing it
    locally.  The transforms are detailed in the ENBD code for known
    ioctls.

    In any case, the client will provide a value in the "len" field of
    the request header which is definitively the size of any data on the
    wire following either the request or the reply header.  The value
    will be a multiple of 512 bytes.  This field should be trusted from
    the point of view of achieving a correct wire protocol.  If that
    field is zero, then there is no data following the request header or
    expected following the reply header.

    If the "len" field is nonzero and the _IOC_READ bit is set, then
    there is data expected to be returned following the reply header.
    If the _IOC_WRITE bit is also set, then there is data following the
    request header also.


   CKSUM requests :

   The data cycle is ...

      0) receive request header
      1) form checksum of indicated data area
      2) send reply header with embedded checksum

   The error field in the reply will be 0 iff the request was satisfied
   correctly (currently nothing more detailed is implemented).


   SPECIAL requests :

   The data cycle is ...

      0) receive request header
      1) do special action
      2) send reply header

   The error field in the reply will be 0 iff the request was satisfied
   correctly (currently nothing more detailed is implemented).

   The interpretation of the action is up to the server.  Normally
   "specials" serve at least as request barriers, and their recption
   implies that no following request can be serviced until all previous
   requests have been serviced.


Details of the initial handshake
--------------------------------

   The handshake is conducted in ascii.  Each element of the interchange
   is terminated by a newline character.  Each element sent is initiated
   by a keyword (possibly preceded by whitespace) followed by some
   whitespace and then an ascii data field (possibly followed by
   whitespace).

   There is a timeout of 30s for the handshake (and each of its
   elements). 

   The server initiates each element of the handshake, in the specified
   order, after connection.  The client merely responds.  If a response
   is needed from the client within a particular element, the client
   responds with the data item with no keyword, then a newline.  The
   client response may contain leading and trailing whitespace.

   The order of the elements are as follows:

         keyword data         direction   purpose
         ----    ----         ---------   -------
     1)  helo    "nbd-client"  sent       general identification
     2)  pass    ...           sent       ignored
     3)  mgck    8 byte int    sent       identifier 0x00420281861253LL
     4)  rdev    4 byte int    recvd      identify clients nbd device
     5)  size    8 byte int    sent       size of resource in bytes
     6)  sign    <128 chars    sent       ascii signature identifying server
     7)  ro      "yes"/"no"    sent       boolean for resource readonly
     8)  bksz    4 byte int    both       agreed blocksize (max offer taken)
     9)  rqto    4 byte int    both       keep alive intvl (max taken)
     10) nprt    4 byte int    recvd      number of channels requested
     11) port    4 byte int    sent       session port assigned to client

   The server is responsible for choosing the session port.

   Sending the session port terminates the handshake. Further
   communication goes via the session port. The server will launch 
   new thread listening on the session port to handle the session
   data cycle. It will launch one thread for each channel agreed in the
   initial handshake.

Details of the reconnection handshake
-------------------------------------

   This takes place on the session port. Its order is:

         keyword data         direction   purpose
         ----    ----         ---------   -------
     1)  helo    "old-friend"  sent       general identification
     2)  pass    ...           sent       ignored
     3)  mgck    8 byte int    sent       identifier 0x00420281861253LL
     4)  sign    <128 chars    sent       ascii signature identifying server

   After the handshake is sent, the data cycle begins on the
   connection.

   The handshake has 30s in which to complete. If it does not, the
   thread must die.

Reconnections
-------------

  The threads handling the data cycle may die at any time. When they do
  the thread which did the initial handshake (the "session master")
  must start a new thread to replace it. The new thread will listen for
  a reconnection handshake from the client on the agreed session port.

Session server duty cycle
-------------------------

  While the slave threads are handling the data cycle, the session
  thread (the one which did the initial handshake) must perform certain 
  housekeeping duties. One of the duties is to collect SIGCHLD and
  launch a replacement. Another duty is to periodically open and close
  the resource in order to let the servers kernel detect media changes.
  It may also choose to run fsync() on the resource, in order to flush
  buffers in VM.

Signals
--------

  SIGTERM causes the server to shut down all connections and die.

