Skip to content

[New Feature]: Consider moving evaluator logic to Airflow #7

@stirlingalgermissen

Description

@stirlingalgermissen

The Unity Initiator as defined in the README is largely managed by AWS infrastructure. At first glance there are a number of challenges with this design:

  • Evaluator queues and DLQs are outside of the Airflow framework

    • Evaluator jobs are difficult for operators to track and there is no visibility into this system from the Airflow UI
    • Simple operator actions such as retrying evaluation are difficult at best
    • Tracking evaluator jobs within AWS at scale is basically not possible. Logs are not available within Airflow's interface. Cloudwatch costs are higher than they need to be vs using Airflow's S3 logging capability
    • DLQs are not a robust mechanism for tracking evaluator job failures. Queued jobs expire and are lost forever within 14 days. Same with eval queues. Operators can't look at these easily within the AWS console
    • As mentioned in the README, any evaluator that takes more than 15 minutes would not work out of the box with this design and would require ECS resources. It would be cleaner to run and manage these tasks with k8s just like all other tasks within Airflow
    • Airflow provides an excellent library of tools for running jobs - none of that is available with this design
    • Running this on-prem is not possible without updates
  • Unity initiator and a corresponding Airflow deployment are tightly coupled

    • If Airflow is down, this system will fail and is a mess to cleanup. Continuous deployment is made significantly harder with this design vs leveraging what Airflow/k8s provides. K8s manifests or helm charts, for example, cannot be easily leveraged here for CD
    • dag1->dag2 triggering where an initiator sits between dags with this design is more complicated than it likely needs to be. It's unclear how exactly that would work with this design. It seems you would need to move initiators to Airflow anyway to support the dag1 -> dag2 triggering case
    • With enough messages or Lambdas, this design could easily overwhelm an Airflow deployment. Back-pressure management is much harder

Please let me know if I am misunderstanding anything. Perhaps we can discuss further for SRL? It seems it would be relatively straightforward to move what has been developed to be first-class Airflow DAGs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions