Skip to content
Baoying Wang edited this page Nov 23, 2017 · 1 revision

QFJ HA

You have to consider the QFJ specific limitation during design you HA solution.

FIX TCP connection is a persistent connnection. It is NOT http. It is pure TCP

Single FIX session limitation

You should NOT setup two same FIX sessions (same senderCompID(49)/targetCompID(56)) in parallel. e.g. you are a client, to connect with a platform(e.g. a FX platform, or an Exchange). Normally, The server side will provide a listening address(+port). You will connect to that address with the specified compIDs with your local QFJ application. You should NOT connect to the server with another your local QFJ application. Maybe the server will provide secondary address for failover purpose. You have to check with server support team, when you can use the secondary address(e.g. anytime when primary connection is broken, or requires formal notification via email? from server side). e.g. you are a server, which accept connection requests from multi clients. You have to consider which QFJ server application(if you have multi active ones) should be used for a client Logon request. You have to use some 3rd party software(any name here???) to made such dispatchment.

HA model

Initator or Acceptor HA Model Note
Initiator Live/Standby Only single instance could keep tcp connection with remote FIX acceptor. The standby instance should NOT try the tcp connection, until you find that the live one is broken, e.g. by Linux Heartbeat , Veritas Cluster
Acceptor Live/Live/Live/... NOT SUPPORTED, since it does not support to setup two live tcp connections on same FIX session.
Acceptor Live/Standby this is straight forward. Just setup two nodes, and failover if required, e.g. by Linux Heartbeat, Veritas Cluster
Acceptor Live/Live/Live/... You need a dispatcher for the tcp request from FIX initiaotr, e.g. F5(looks like expensive), others

note: for any HA model, you have to consider whether it is required to share the FIX message among the instances. It is required to share, if sequence number is expected to continue on intra-day reonnection, to recovery the gap messages. See below section about whether you need the sharing.

Message sharing cross instances? (AKA continue the sequence number on intra-day reonnection, to recovery the gap messages?)

Intra-day reconnection concept

there is a session close window normally for FIX session, e.g. 10 minutes daily. When session start, the session will start with 1. For any re-connection during the non-session-close-window start, it means intra-day reconnection. In such case, you could continue the sequence number, or reset it, based on business requirement. More about the sequence number and FIX session: https://www.fixtrading.org/standards/fixt https://www.onixs.biz/fix-dictionary/FIXT1.1/section_session_protocol.html

Whether message sharing (QFJ Storage) cross instances for FIX initiator

  • No, if message loss is ok, or there is application level guranrety delivery(via ack). You should always reset the sequence number(141=Y in the Logon) during intra-day reconnection.
  • Yes, if you hope to recover any sequence GAP application message by FIX session transport.

note:what's FIX sequence GAP? see [ https://www.fixtrading.org/standards/fixt https://www.onixs.biz/fix-dictionary/FIXT1.1/section_session_protocol.html ] note: you should check with your FIX acceptor on it. note: session reset could simplify handshake. It is always recommended if possible.

Whether message sharing (QFJ Storage) cross instances for FIX Acceptor

  • No, if application level gurantee delivery is defined. You should let your initiator to always reset the sequence number, otherwise the session maybe meet issue while Acceptor side failover or they reconnection.
  • No, if it is ok for message loss. You should let your initiator to always reset the sequence number, otherwise the session maybe meet issue while Acceptor side failover or they reconnection.
  • Yes, if you hope to recover any sequence GAP application message by FIX session transport.

note: why the session maybe has problem while client reconnection, without session reset? Because maybe they will connect to another Acceptor instance, where there is NO sequence information. Then the sequence number cannot continue. The session maybe goes to wrong state because of sequence number mismatch. note: you have to define the sequence number reset strategy based on business requirement.

How to share the messages in a shared storage

Shared disk(e.g. NAS, SAN, DAS), or database(via JDBC) can be used. In additionary setup related storate(as described in below sections), you have to set RefreshMessageStoreAtLogon=Y. It will promise to refresh the state from the storage when a client logon. For this parameter, please see http://www.quickfixj.org/quickfixj/usermanual/1.6.4/usage/acceptor_failover.html

By shared disk, e.g. NAS, SAN, DAS.

It introduces complexity, too. And looks like it is more expensive than a open source database server, e.g. mysql or postgres. During failover,

  • stop the failure node,
  • standby node mount the disk
  • standy node start, as live

Here is the QFJ configuration path FileStorePath=./fileStore/FirstQFJServer

By database-jdbc

follow the [QFJ Storage - JDBC as message store] to setup JDBC store factory in source code, and related database.

Solution for failover

  1. Linux Heartbeat: http://www.linux-ha.org/wiki/Heartbeat "Heartbeat is a daemon that provides cluster infrastructure (communication and membership) services to its clients. This allows clients to know about the presence (or disappearance!) of peer processes on other machines and to easily exchange messages with them." It can be used for both Initiator and Acceptor. But looks like many configuration is required on it. You need a linux operation expert on it.

  2. for FIX Acceptor, F5 or other similar things to distribute the TCP connection request

  3. develop your own HA solution,

Clone this wiki locally