Skip to content

Nature of error not considered when deciding whether to use a fallback host for realtime connections #444

@lawrence-forooghian

Description

@lawrence-forooghian

RTN17f:

Errors that necessitate use of an alternative host include any of the failure conditions specified in RSC15l, and additionally also (…)

But — and perhaps I've just not read things properly — I can't see anywhere where we take into account the error when deciding whether to use a fallback host:

def should_use_fallback_hosts?
if client.fallback_hosts && !client.fallback_hosts.empty?
if connecting? && previous_state && !disconnected_from_connected_state?
use_fallback_if_disconnected? || use_fallback_if_suspended?
end
end
end
def disconnected_from_connected_state?
most_recent_state_changes = state_history.last(3).first(2) # Ignore current state
# A valid connection was disconnected
most_recent_state_changes.last.fetch(:state) == Connection::STATE.Disconnected &&
most_recent_state_changes.first.fetch(:state) == Connection::STATE.Connected
end
def use_fallback_if_disconnected?
second_reconnect_attempt_for(:disconnected, 1)
end
def use_fallback_if_suspended?
second_reconnect_attempt_for(:suspended, 2) # on first suspended state use default Ably host again
end
def second_reconnect_attempt_for(state, first_attempt_count)
previous_state == state && manager.retry_count_for_state(state) >= first_attempt_count
end

Amongst other things, this means that a realtime connection tries using a fallback host when it fails to establish a connection due to a token error.

I am a bit hesitant about trying to change this code in a rush because I don't fully understand its logic for deciding when to use a fallback host; there is some complicated-looking logic that considers the history of state changes that the connection has been through, which doesn't (at least, obviously) match up to any behaviour described in the spec.

(I've done a partial fix for this in #443, to explicitly prevent token errors from triggering usage of fallback hosts, but it needs fixing properly to instead only use a fallback host when faced with one of the errors described in RTN17f. To do that would require a deeper understanding of how these errors manifest inside our library.)

Note that the REST code does seem to care about the nature of the error:

rescue *([Faraday::TimeoutError, Ably::Exceptions::ServerError] + FARADAY_CLIENT_OR_SERVER_ERRORS) => error
retry_sequence_id ||= SecureRandom.urlsafe_base64(4)
time_passed = Time.now - requested_at
if can_fallback_to_alternate_ably_host? && (retry_count < max_retry_count) && (time_passed <= max_retry_duration)
retry_count += 1
retry_log_severity = log_retries_as_info ? :info : :warn
logger.public_send(retry_log_severity) { "Ably::Rest::Client - Retry #{retry_count} for #{method} #{path} #{params} as initial attempt failed (seq ##{retry_sequence_id}): #{error}" }
retry
end
retry_log_severity = log_retries_as_info ? :info : :error
logger.public_send(retry_log_severity) do
"Ably::Rest::Client - Request FAILED after #{retry_count} #{retry_count > 1 ? 'retries' : 'retry' } for" \
" #{method} #{path} #{params} (seq ##{retry_sequence_id}, time elapsed #{(Time.now.to_f - requested_at.to_f).round(2)}s)"
end
case error
when Faraday::TimeoutError
raise Ably::Exceptions::ConnectionTimeout.new(error.message, nil, Ably::Exceptions::Codes::CONNECTION_TIMED_OUT, error, { request_id: request_id })
when *FARADAY_CLIENT_OR_SERVER_ERRORS
# request_id is also available in the request context
raise Ably::Exceptions::ConnectionError.new(error.message, nil, Ably::Exceptions::Codes::CONNECTION_FAILED, error, { request_id: request_id })
else
raise error
end
end
end

┆Issue is synchronized with this Jira Task by Unito

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working. It's clear that this does need to be fixed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions