Rationale

The code in this daemon process was developed and separated from libsmoco sources at a critical time in the development of smocod.

It had been recently discovered that during high-volume message traffic between smocod and the controllers on the SMOCO CAN bus, the SMOCO system's Ethernet-CAN gateway device (HMS "IXXAT CAN@Net II") was silently dropping messages as its transmit buffer overflowed. Various measures were taken to try to prevent this, including the suggestion from HMS that requests be sent to the IXXAT device to check the status of the transmit buffer – these query messages would be sent to the IXXAT along with all other SMOCO CAN traffic, and took a significant time to process. By the time the responses to these requests were received, the values proved to be sufficiently stale as to be unusable. Artificial rate-limiting and other measures were attempted, but these created or exacerbated existing timing-related problems.

At this same approximate time an upgrade became necessary for development and deployment, to the newer "CAN@Net NT 200" device due to the discontinuation of the earlier product, and unfortunately the new device suffered from the same problem. Critically however, the new device was equipped with two CAN transceivers, individually addressable and configurable through the new extended ASCII ethernet-side protocol.

By physically wiring the second CAN transceiver in parallel to the first on the same CAN bus and setting its filter to ignore messages from SMOCO and only listen to messages from smocod, a means became available to implement a 100% reliable flow control mechanism. Messages from the second channel would be interpreted as "echoes" of commands sent to SMOCO nodes through the IXXAT, thus ensuring that they had been sent successfully.

The code to perform this flow control was at first attempted within libsmoco, by various means. At the time this was being attempted, smocod (SCS and SMOCO interaction) and libsmoco (low-level SMOCO communication code) were suffering from a number of deep functional problems, due partly to bugs in libsmoco and also to severe timing-dependent SMOCO firmware bugs that would not be discovered until much later. There was much uncertainty about the source of these problems, including notably the apparent inconsistency of message transmission latency through the new flow control code, and its possible interference with other libsmoco code execution as a result of thread locking. As a result, a need was declared to isolate the flow-control problem from the smocod process, creating a stable foundation with this lowest-level code after proving its performance to be essentially perfect. This initiative resulted in the development of smocand, a separate daemon to provide a guaranteed-reliable communication interface to SMOCO hardware with a very low and consistent transmit latency. As an additional potential benefit, the relocated code could be theoretically run from anywhere on the network, including a computer separate from that which would run SCS and smocod.

Code was initially copied from libsmoco, but essentially none of this early code remains. smocand represents some of the earliest refactoring of smocod-related code to use message queueing, in particular ZeroMQ and its philosophy of concurrent code development (put simply, "don't share data between threads"). By avoiding the use of shared state, locking, semaphores and other traditional IPC mechanisms, smocand could be written in isolated functional layers, with each layer proved functional in turn, and from this established basis the remainder of problems in SMOCO and smocod could be found and resolved.

Architecture and Layers

The work performed by smocand is divided into three layers of operation, each comprising one or more threads which communicate with each other via message queues.

Layer 1: "CAN-IXXATNT"

The lowest "CAN-IXXATNT" layer handles direct communication with SMOCO nodes through the IXXAT-NT gateway device. This layer acts as a protocol translator, communicating with the two CAN transceivers on the IXXAT via its ASCII-based protocol. It separates echoes from normal incoming responses from SMOCO and relays them to separate message queues. Three ZMQ message queue sockets are provided by this layer:

-> Incoming messages to transmit from an application to SMOCO (TX)
<- Outgoing messages received from SMOCO and relayed to an application (RX) (exposed to applications)
<- Outgoing messages received from the SMOCO bus via second channel, which were sent by the application (ECHO)

Layer 2: "Echo-Reply"

The "Echo-Reply" layer presents three message queue sockets:

<-> Incoming messages to send to SMOCO, and outgoing echoes routed only back to the sender (TXECHO)
<- Outgoing messages being sent to SMOCO (connected to TX socket of CAN-IXXATNT layer)
-> Incoming echoes (connected to ECHO socket of CAN-IXXATNT layer)

The Echo-Reply layer does the work of associating incoming echoes with the previously transmitted message that generated them. This is performed by keeping a table of messages that have been sent, and comparing against that table when echoes arrive. Since messages are not uniquely identified, this is based only on the message content and the node to which they are addressed. Since the SMOCO CAN protocol mandates that only one message can be sent to a node at a time, this ends up working in practice. Multiple senders can submit messages to this layer via the TXECHO socket; the Echo-Reply layer ensures that the echo is sent only to the sender of the corresponding message.

Layer 3: "CAN-Send"

The third (and at the moment, highest) functional layer is the "CAN-Send" layer, which provides the bulk of the fault-tolerance logic, resending messages as needed until an echo is received. It has one main message queue socket:

Incoming messages to be transmitted to SMOCO, with acknowledgement replies (exposed to applications)

The incoming messages and acknowledgements (indicating success/failure of transmission) are handled by a single thread in this layer, and a fixed set of additional threads (default is 10 threads) is used for allowing multiple messages to be transmitted while waiting for echoes. The outgoing messages are round-robined to these threads, each of which presents a message queue socket:

Outgoing messages to send to SMOCO, and incoming echoes (connected to TXECHO socket of Echo-Reply layer)

This layer implements a timeout and retry mechanism, resending messages for which no echo is found on the bus. After a number of retries and no echo, a fault is reported back to the calling application, indicating a serious communication error.

Application Interface

The resulting application interface is a simple pair of sockets for communication with SMOCO:

Outgoing messages to SMOCO and acknowledgements, a ROUTER type socket to accept REQ type client connections. An acknowledgement or fault reponse is guaranteed for each transmitted message. By default, this socket is located at TCP port 5000.
Incoming messages from SMOCO, a PUB type socket to accept SUB client connections. By default, this socket is located at TCP port 5001.
A control socket providing an optional means to control (eg. stop) the smocand process, by default at port 5002.

The details of this interface are made available to C programs with the header file smocand_protocol.h, which contains these defaults, and the structure type for the CAN-emulating protocol used by smocand.