Feature/admx dl3 #195

dzhang0305 · 2025-04-16T18:21:29Z

update and add/remove sections as needed

New Features

please describe new capabilities

dripline/core/entity.py
The scheduled_log functionality is copied from the develop branch but not in the main branch
dripline/core/interface.py
Add the old authentication_obj way to set the Authentication.
dripline/core/service.py
Schedule a message to check a specific routing_key (rk_aliveness) is alive every heartbeat_broker_s seconds. This fixed the ghost queue and connection that appears after long idle time. This require that the service yaml file has two more inputs
e.g.


name: channel-1-switch
module: EthernetSCPIService
socket_info: "('10.95.101.122',23)"
socket_timeout: 5
cmd_at_reconnect:
  -
  - "MN?"
command_terminator: "\n"
response_terminator: "\r\n"
reconnect_test: "MN=RC-4SPDT-A18"
heartbeat_broker_s: 100
rk_aliveness: ch1_receiver_switch_state
endpoints:
  - name: ch1_receiver_switch_state
    module: FormatEntity
    get_str: "SWPORT?"
    set_str: "SETP={}"
    calibration: "{}"
    log_on_set: false
    get_on_set: false
    set_value_map:
      "sweep_ch1" : 1 #  0b0001 # 1
      "bypass_ch1" : 3 #0b0011 # 3
      "SAG" : 9 # 0b1001 # 9
      "sag" : 9 # 0b1001 # 9
      "jpa_pump" : 7 #  0b0111 # 7
      "take_data" : 8 #    0b1000 # 8

Fixes

dripline/core/calibrate.py
convert the value_raw to string before looking up a calibration dictionary because the latter is by default interpret as a string-to-string if it's a dictionary (as least for bools).
dripline/implementations/postgres_interface.py
change this_select = sqlalchemy.select(return_cols) to this_select = sqlalchemy.select(*return_cols)

Prior to merging for releases:

update the project's version in the top-level CMakeLists.txt file
update the appVersion to be the new container image tag version in chart/Chart.yaml

… logging

wcpettus · 2025-05-07T20:33:36Z

Sorry this is taking so long. I suggest splitting the review:

New Features:

me
- Since this is merging into develop (where these changes already exist) and not main, this is just creating a confusing conflict history. The only diff is in one of the debug lines, so I suggest keeping what's already in develop and excluding this.
@nsoblath (will be much faster parsing changes to authentication)
@nsoblath (probably much faster parsing changes to heartbeat)

Fixes

me
- Can you say more about this error? In dl3 does scarab parsing the yaml force all dictionary keys to be strings? This didn't use to be the case, but clearly you've found an error you need fixed.
me
- When I looked into this for Bug in postgres_interface.py #193, I discovered there was another bug with this implementation, which I propose patching with this instead Fix select in postgres_interface #200. Comments welcome if this doesn't fix the problem for your use case though.

dzhang0305 · 2025-05-07T21:32:07Z

New Features:
3. @wcpettus @nsoblath I have updated this recently to use MsqRequest directly. But haven't upload yet. I would like to take suggestions to speed it up.

Fixes

Here is an example of such endpoint. The calibration Keys (True and False) were interpreted as bools after scarab parsing the yaml config in DL2, but is taken as strings in DL3.

- name: Oxygen_Alarm
    module: plc_bool
    register: 12445
    bit: 5
    calibration:
      True: nominal
      False: alarm
    log_interval: 30
    max_interval: 600
    max_fractional_change: 0.1

I tested your suggestion. It worked. I would like to take your suggestion

nsoblath · 2025-05-08T01:40:39Z

Regarding Fix 1: this is definitely a result of switching the underlying dl implementation to dl-cpp, even though calibrations are solely implemented in dl-py. Everything specified in a configuration file is first processed by the C++ code, and the specifications for how that's interpreted are taken, essentially, from the JSON specs. In a JSON object, the keys are always strings.

I haven't given much thought yet to a solution, but I think we want to bound it by (a) satisfying the needs of arbitrary type mappings for a calibration, matching the dl2 behavior, and (b) not making the config file notation more (or "too"?) complicated.

nsoblath · 2025-05-12T20:08:25Z

@dzhang0305 Regarding new feature 2 (adding the authentication object to the Interface init parameters), can you please describe a little about the use case for this?

nsoblath · 2025-05-12T20:27:10Z

@dzhang0305 Regarding new feature 3 (new heartbeat), let me see if I understand the issue and fix correctly, and please correct anything that's wrong:

This is to address the problem you reported via ADMX channels a while ago where queues would
stick around beyond the life of the service that created them, making it impossible to restart a service quickly, e.g. after a crash.
You've figured out that these "ghost" queues seem to occur when a queue has not received a message for some extended period of time.
Therefore you implemented a way to avoid ghost queues by periodically sending a message to a RK that's bound to the service's own queue.

Assuming that's more-or-less correct, I have some thoughts and questions:

Do you have a sense for how long it takes a ghost queue to occur? In other words, how long does a queue need to be inactive for it to have this problem when the service shuts down?
The underlying problem seems to be either with the broker, with the rabbitmq library we use, or in how we're using the rabbitmq library, so it doesn't seem like this should be a change to the dripline standard.
However, I would anticipate that this problem is not isolated to dripline-python, and therefore I think it would be better to implement a solution like this in dripline-cpp instead of here.
Did you try having the service send a request to the service instead of one of its endpoints? Then it could be implemented without needing to separately specify the rk_aliveness parameter.

dzhang0305 · 2025-05-12T22:13:52Z

@dzhang0305 Regarding new feature 2 (adding the authentication object to the Interface init parameters), can you please describe a little about the use case for this?

I was thinking dl-serve can run with args like "--auth-file /root/authentications.json" instead of only rely on the environment variables DRIPLINE_USER and DRIPLINE_PASSWORD. I don't think it's very crucial.

dzhang0305 · 2025-05-12T22:29:51Z

@dzhang0305 Regarding new feature 3 (new heartbeat), let me see if I understand the issue and fix correctly, and please correct anything that's wrong:

This is to address the problem you reported via ADMX channels a while ago where queues would

stick around beyond the life of the service that created them, making it impossible to restart a service quickly, e.g. after a crash.

You've figured out that these "ghost" queues seem to occur when a queue has not received a message for some extended period of time.

Therefore you implemented a way to avoid ghost queues by periodically sending a message to a RK that's bound to the service's own queue.

Assuming that's more-or-less correct, I have some thoughts and questions:

Do you have a sense for how long it takes a ghost queue to occur? In other words, how long does a queue need to be inactive for it to have this problem when the service shuts down?

The underlying problem seems to be either with the broker, with the rabbitmq library we use, or in how we're using the rabbitmq library, so it doesn't seem like this should be a change to the dripline standard.

However, I would anticipate that this problem is not isolated to dripline-python, and therefore I think it would be better to implement a solution like this in dripline-cpp instead of here.

Did you try having the service send a request to the service instead of one of its endpoints? Then it could be implemented without needing to separately specify the rk_aliveness parameter.

Yes, this is to solve the 'ghost queue and connection' problem. I don't have a solid proof. But it feels like the problem showed up when the rabbitmq log related to this service is getting too long, the major connection of the queue is buried in the history log (there are many connections repeated and destroyed within a couple of ms). It becomes problematic when it takes more than 10s (the timeout setting) to find the connection of the queue.

I haven't tested how long it takes the queue/connection to fail. I will do it this week. My impression was 2hours without talking to the service is 100% problematic, talking to it every 5min is totally fine.

I haven't tried only talk to the service. Like fast_daq has its own heartbeat message, but the 'ghost queue problem' still showed up. The full round of "sending message and processing reply" is what secures the healthy connection. I would like to solve it in a more elegant way, but now I implement an 'on_get' of an endpoint.

I agree that if it's implemented in dripline-cpp, it would be more universal. Or we could implement the aliveness check in the k8s or docker swarm instead of dripline. For fast_daq which based on dripline-cpp only, I implemented healthcheck in the docker compose file with 'docker exec <> dl-agent get'. The workaround in docker swarm serves the purpose as well.

nsoblath · 2025-05-12T23:08:06Z

@dzhang0305 Regarding new feature 2 (adding the authentication object to the Interface init parameters), can you please describe a little about the use case for this?

I was thinking dl-serve can run with args like "--auth-file /root/authentications.json" instead of only rely on the environment variables DRIPLINE_USER and DRIPLINE_PASSWORD. I don't think it's very crucial.

This is already possible. For all CL options, including those related to authentication, you can see the docs or use dl-serve -h on the command line.

Amelia added 3 commits February 18, 2025 15:09

entity image that can use max_interval and max_fractional_change when…

21976bd

… logging

debug do_select dripline/implementations/postgres_interface.py

be865c8

admx application with dl3 and docker swarm

63b8047

service schedule heartbeat change

ed5a4a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/admx dl3 #195

Feature/admx dl3 #195

Uh oh!

dzhang0305 commented Apr 16, 2025 •

edited

Loading

Uh oh!

wcpettus commented May 7, 2025 •

edited

Loading

Uh oh!

dzhang0305 commented May 7, 2025

Uh oh!

nsoblath commented May 8, 2025

Uh oh!

nsoblath commented May 12, 2025

Uh oh!

nsoblath commented May 12, 2025

Uh oh!

dzhang0305 commented May 12, 2025

Uh oh!

dzhang0305 commented May 12, 2025 •

edited

Loading

Uh oh!

nsoblath commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Feature/admx dl3 #195

Are you sure you want to change the base?

Feature/admx dl3 #195

Uh oh!

Conversation

dzhang0305 commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Features

Fixes

Prior to merging for releases:

Uh oh!

wcpettus commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzhang0305 commented May 7, 2025

Uh oh!

nsoblath commented May 8, 2025

Uh oh!

nsoblath commented May 12, 2025

Uh oh!

nsoblath commented May 12, 2025

Uh oh!

dzhang0305 commented May 12, 2025

Uh oh!

dzhang0305 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nsoblath commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dzhang0305 commented Apr 16, 2025 •

edited

Loading

wcpettus commented May 7, 2025 •

edited

Loading

dzhang0305 commented May 12, 2025 •

edited

Loading