Skip to content

Conversation

@dzhang0305
Copy link

@dzhang0305 dzhang0305 commented Apr 16, 2025

update and add/remove sections as needed

New Features

please describe new capabilities

  1. dripline/core/entity.py
    The scheduled_log functionality is copied from the develop branch but not in the main branch
  2. dripline/core/interface.py
    Add the old authentication_obj way to set the Authentication.
  3. dripline/core/service.py
    Schedule a message to check a specific routing_key (rk_aliveness) is alive every heartbeat_broker_s seconds. This fixed the ghost queue and connection that appears after long idle time. This require that the service yaml file has two more inputs
    e.g.

name: channel-1-switch
module: EthernetSCPIService
socket_info: "('10.95.101.122',23)"
socket_timeout: 5
cmd_at_reconnect:
  -
  - "MN?"
command_terminator: "\n"
response_terminator: "\r\n"
reconnect_test: "MN=RC-4SPDT-A18"
heartbeat_broker_s: 100
rk_aliveness: ch1_receiver_switch_state
endpoints:
  - name: ch1_receiver_switch_state
    module: FormatEntity
    get_str: "SWPORT?"
    set_str: "SETP={}"
    calibration: "{}"
    log_on_set: false
    get_on_set: false
    set_value_map:
      "sweep_ch1" : 1 #  0b0001 # 1
      "bypass_ch1" : 3 #0b0011 # 3
      "SAG" : 9 # 0b1001 # 9
      "sag" : 9 # 0b1001 # 9
      "jpa_pump" : 7 #  0b0111 # 7
      "take_data" : 8 #    0b1000 # 8

Fixes

  1. dripline/core/calibrate.py
    convert the value_raw to string before looking up a calibration dictionary because the latter is by default interpret as a string-to-string if it's a dictionary (as least for bools).
  2. dripline/implementations/postgres_interface.py
    change this_select = sqlalchemy.select(return_cols) to this_select = sqlalchemy.select(*return_cols)

Prior to merging for releases:

  • update the project's version in the top-level CMakeLists.txt file
  • update the appVersion to be the new container image tag version in chart/Chart.yaml

@wcpettus
Copy link
Contributor

wcpettus commented May 7, 2025

Sorry this is taking so long. I suggest splitting the review:

New Features:

  1. me
    • Since this is merging into develop (where these changes already exist) and not main, this is just creating a confusing conflict history. The only diff is in one of the debug lines, so I suggest keeping what's already in develop and excluding this.
  2. @nsoblath (will be much faster parsing changes to authentication)
  3. @nsoblath (probably much faster parsing changes to heartbeat)

Fixes

  1. me
    • Can you say more about this error? In dl3 does scarab parsing the yaml force all dictionary keys to be strings? This didn't use to be the case, but clearly you've found an error you need fixed.
  2. me

@dzhang0305
Copy link
Author

New Features:
3. @wcpettus @nsoblath I have updated this recently to use MsqRequest directly. But haven't upload yet. I would like to take suggestions to speed it up.

Fixes

  1. Here is an example of such endpoint. The calibration Keys (True and False) were interpreted as bools after scarab parsing the yaml config in DL2, but is taken as strings in DL3.
- name: Oxygen_Alarm
    module: plc_bool
    register: 12445
    bit: 5
    calibration:
      True: nominal
      False: alarm
    log_interval: 30
    max_interval: 600
    max_fractional_change: 0.1
  1. I tested your suggestion. It worked. I would like to take your suggestion

@nsoblath
Copy link
Member

nsoblath commented May 8, 2025

Regarding Fix 1: this is definitely a result of switching the underlying dl implementation to dl-cpp, even though calibrations are solely implemented in dl-py. Everything specified in a configuration file is first processed by the C++ code, and the specifications for how that's interpreted are taken, essentially, from the JSON specs. In a JSON object, the keys are always strings.

I haven't given much thought yet to a solution, but I think we want to bound it by (a) satisfying the needs of arbitrary type mappings for a calibration, matching the dl2 behavior, and (b) not making the config file notation more (or "too"?) complicated.

@nsoblath
Copy link
Member

@dzhang0305 Regarding new feature 2 (adding the authentication object to the Interface init parameters), can you please describe a little about the use case for this?

@nsoblath
Copy link
Member

@dzhang0305 Regarding new feature 3 (new heartbeat), let me see if I understand the issue and fix correctly, and please correct anything that's wrong:

  • This is to address the problem you reported via ADMX channels a while ago where queues would
  • stick around beyond the life of the service that created them, making it impossible to restart a service quickly, e.g. after a crash.
  • You've figured out that these "ghost" queues seem to occur when a queue has not received a message for some extended period of time.
  • Therefore you implemented a way to avoid ghost queues by periodically sending a message to a RK that's bound to the service's own queue.

Assuming that's more-or-less correct, I have some thoughts and questions:

  • Do you have a sense for how long it takes a ghost queue to occur? In other words, how long does a queue need to be inactive for it to have this problem when the service shuts down?
  • The underlying problem seems to be either with the broker, with the rabbitmq library we use, or in how we're using the rabbitmq library, so it doesn't seem like this should be a change to the dripline standard.
  • However, I would anticipate that this problem is not isolated to dripline-python, and therefore I think it would be better to implement a solution like this in dripline-cpp instead of here.
  • Did you try having the service send a request to the service instead of one of its endpoints? Then it could be implemented without needing to separately specify the rk_aliveness parameter.

@dzhang0305
Copy link
Author

@dzhang0305 Regarding new feature 2 (adding the authentication object to the Interface init parameters), can you please describe a little about the use case for this?

I was thinking dl-serve can run with args like "--auth-file /root/authentications.json" instead of only rely on the environment variables DRIPLINE_USER and DRIPLINE_PASSWORD. I don't think it's very crucial.

@dzhang0305
Copy link
Author

dzhang0305 commented May 12, 2025

@dzhang0305 Regarding new feature 3 (new heartbeat), let me see if I understand the issue and fix correctly, and please correct anything that's wrong:

  • This is to address the problem you reported via ADMX channels a while ago where queues would
  • stick around beyond the life of the service that created them, making it impossible to restart a service quickly, e.g. after a crash.
  • You've figured out that these "ghost" queues seem to occur when a queue has not received a message for some extended period of time.
  • Therefore you implemented a way to avoid ghost queues by periodically sending a message to a RK that's bound to the service's own queue.

Assuming that's more-or-less correct, I have some thoughts and questions:

  • Do you have a sense for how long it takes a ghost queue to occur? In other words, how long does a queue need to be inactive for it to have this problem when the service shuts down?
  • The underlying problem seems to be either with the broker, with the rabbitmq library we use, or in how we're using the rabbitmq library, so it doesn't seem like this should be a change to the dripline standard.
  • However, I would anticipate that this problem is not isolated to dripline-python, and therefore I think it would be better to implement a solution like this in dripline-cpp instead of here.
  • Did you try having the service send a request to the service instead of one of its endpoints? Then it could be implemented without needing to separately specify the rk_aliveness parameter.

Yes, this is to solve the 'ghost queue and connection' problem. I don't have a solid proof. But it feels like the problem showed up when the rabbitmq log related to this service is getting too long, the major connection of the queue is buried in the history log (there are many connections repeated and destroyed within a couple of ms). It becomes problematic when it takes more than 10s (the timeout setting) to find the connection of the queue.

I haven't tested how long it takes the queue/connection to fail. I will do it this week. My impression was 2hours without talking to the service is 100% problematic, talking to it every 5min is totally fine.

I haven't tried only talk to the service. Like fast_daq has its own heartbeat message, but the 'ghost queue problem' still showed up. The full round of "sending message and processing reply" is what secures the healthy connection. I would like to solve it in a more elegant way, but now I implement an 'on_get' of an endpoint.

I agree that if it's implemented in dripline-cpp, it would be more universal. Or we could implement the aliveness check in the k8s or docker swarm instead of dripline. For fast_daq which based on dripline-cpp only, I implemented healthcheck in the docker compose file with 'docker exec <> dl-agent get'. The workaround in docker swarm serves the purpose as well.

@nsoblath
Copy link
Member

@dzhang0305 Regarding new feature 2 (adding the authentication object to the Interface init parameters), can you please describe a little about the use case for this?

I was thinking dl-serve can run with args like "--auth-file /root/authentications.json" instead of only rely on the environment variables DRIPLINE_USER and DRIPLINE_PASSWORD. I don't think it's very crucial.

This is already possible. For all CL options, including those related to authentication, you can see the docs or use dl-serve -h on the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants