Skip to content

Enhance Airflow and Celery integration with improved logging and Docker configuration#34

Merged
LuisJG8 merged 1 commit intoMyProjectsfrom
from_json_to_pqt
Feb 17, 2026
Merged

Enhance Airflow and Celery integration with improved logging and Docker configuration#34
LuisJG8 merged 1 commit intoMyProjectsfrom
from_json_to_pqt

Conversation

@LuisJG8
Copy link
Copy Markdown
Owner

@LuisJG8 LuisJG8 commented Feb 17, 2026

No description provided.

@LuisJG8
Copy link
Copy Markdown
Owner Author

LuisJG8 commented Feb 17, 2026

/gemini review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @LuisJG8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the robustness and maintainability of the Airflow and Celery integration. It streamlines the Airflow DAG structure, introduces a standardized and configurable logging system for Celery workers, and improves Docker deployment by configuring log rotation. These changes collectively lead to more reliable task execution and better observability of the distributed system.

Highlights

  • Airflow DAG Refactoring: The main Airflow DAG function run_github_data_queue was renamed to run_queue, and an internal task run_queue was renamed to send_task_to_celery_worker. This task now explicitly passes start_in_repo_num and batch_size as keyword arguments to the Celery task.
  • Standardized Celery Logging: A new file celery_logging_config.py was introduced to centralize and standardize Celery worker logging. This configuration sets the log level via an environment variable and defines a consistent log format.
  • Celery Worker Logging Integration: The worker.py file now imports and utilizes the new celery_logging_config to apply the standardized logging. All print statements previously used for logging have been replaced with logger.info or logging.exception for proper log handling.
  • Docker Compose Logging Configuration: The docker-compose.yml file was updated to include json-file logging driver configuration for the celery_worker service, enabling log rotation with max-size and max-file options.
  • GitHub API Initialization in Celery Task: GitHub API token and instance initialization were moved inside the get_github_data Celery task in worker.py, ensuring that API instances are created per task execution rather than globally. The github_instance parameter was also removed from the task signature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • airflow/dags/run_queue.py
    • Renamed the main DAG function run_github_data_queue to run_queue.
    • Renamed the internal task run_queue to send_task_to_celery_worker.
    • Modified the send_task_to_celery_worker task to pass start_in_repo_num and batch_size as kwargs to the Celery task.
    • Updated task dependencies to reflect the new task name.
  • celery_logging_config.py
    • Added a new file to define a standardized logging configuration for Celery, including log level from environment variables and a specific format.
  • docker-compose.yml
    • Configured json-file logging for the celery_worker service, including max-size and max-file options for log rotation.
  • rb_queue/rabbitmq.py
    • Corrected a typo in a code comment from 'becayse' to 'because'.
  • worker.py
    • Imported logging and celery_logging_config for standardized logging.
    • Replaced print statements with logger.info and logging.exception for improved logging practices.
    • Moved GitHub API token and instance initialization inside the get_github_data task.
    • Removed the github_instance parameter from the get_github_data task signature.
    • Added a break condition for counter >= 5 within the repository processing loop.
    • Removed commented-out old code related to task distribution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@LuisJG8 LuisJG8 merged commit 9a3c7f6 into MyProjects Feb 17, 2026
1 check failed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the Airflow and Celery integration by improving logging and Docker configuration. The changes include adding a centralized Celery logging configuration, setting up log rotation in Docker, and refactoring the Airflow DAG and Celery worker. My review has identified a few issues: a critical piece of debug code left in the worker that limits processing, a change that breaks the data flow in a Celery chord, inefficient initialization of GitHub clients, and some inconsistencies in logging. I've also pointed out an unused import and hardcoded values that could be improved.

Comment on lines +130 to +131
if counter >= 5:
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block unconditionally breaks the loop after 5 iterations. This appears to be leftover debugging code and will prevent the task from processing the full batch_size. It should be removed.

# bind = True allows to get task data, like task id
@app.task(bind=True)
def get_github_data(self, start_in_repo_num: int = 0, batch_size: int = 500, github_instance: Github = gh):
def get_github_data(self, start_in_repo_num: int = 0, batch_size: int = 500):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return mylist statement was removed from this function. This breaks the Celery chord (build_repo_chord) which relies on this task's return value to pass to aggregate_results. Without it, aggregate_results will receive None and the final result will be incorrect. Please add return mylist at the end of the function.

Comment on lines +52 to +54
api_token, api_token_two = os.getenv("GITHUB_API_TOKEN"), os.getenv("GITHUB_API_TOKEN_SECOND_ACCOUNT")
auth, auth_two = Auth.Token(api_token), Auth.Token(api_token_two)
gh, gh_two = Github(auth=auth), Github(auth=auth_two)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The GitHub client instances (gh, gh_two) are initialized inside the get_github_data task. This is inefficient as it will happen on every task execution. These clients should be initialized once at the module level to be reused across task runs, as it was in the previous version of the code.



app.send_task("worker.get_github_data")
app.send_task("worker.get_github_data", kwargs={"start_in_repo_num": 1000, "batch_size": 500})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The values for start_in_repo_num and batch_size are hardcoded. This makes the task less flexible and harder to configure. Consider defining these as constants at the top of the file or loading them from a configuration source like Airflow variables.

@@ -0,0 +1,23 @@
import logging
import os
from math import log
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log function from the math module is imported but never used in this file. It should be removed to keep the code clean.

except GithubException as ge:
if ge.status == 403:
print(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the rest of the file and to leverage Celery's task-specific logging, please use the task logger logger.exception instead of logging.exception.

Suggested change
logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
logger.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
else:
print(f"GitHub API error {ge.status} for repo, skipping...")
logging.exception(f"GitHub API error {ge.status} for repo, skipping...")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, please use logger.exception instead of logging.exception.

Suggested change
logging.exception(f"GitHub API error {ge.status} for repo, skipping...")
logger.exception(f"GitHub API error {ge.status} for repo, skipping...")

continue
except Exception as e:
print(f"Error accessing repo data: {e}, skipping...")
logging.exception(f"Error accessing repo data: {e}, skipping...")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency, please use logger.exception instead of logging.exception.

Suggested change
logging.exception(f"Error accessing repo data: {e}, skipping...")
logger.exception(f"Error accessing repo data: {e}, skipping...")

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several enhancements to the Airflow and Celery integration, primarily focusing on improved logging and Docker configuration. The changes to configure Celery logging and Docker log rotation are excellent improvements. However, I've identified a few issues in the implementation, including a critical bug in worker.py where a change breaks the data aggregation logic. There are also some leftover debugging statements and inconsistencies in logging that should be addressed. My review provides specific suggestions to fix these issues and improve the overall code quality.



# s3_url = save_to_s3(data=repo_collection, file_directory="github_repos/test.json")
logger.info(f"Processed {counter} repositories")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The get_github_data task no longer returns the mylist of collected repositories because return mylist was removed. This breaks the chord defined in build_repo_chord, as the aggregate_results task will receive a list of None values instead of the repository data. Please restore return mylist at the end of this function to fix the aggregation logic.

Comment on lines +130 to +131
if counter >= 5:
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This condition if counter >= 5: break appears to be a debugging remnant. It will cause the loop to terminate after processing only 5 repositories, making the batch_size parameter ineffective. This should be removed to ensure the full batch is processed.

Comment on lines 45 to 46
rate_limit = context["ti"].xcom_pull(task_ids="check_rate_limit", key="remaining")
max_total_api_calls = context["ti"].xcom_pull(task_ids="check_rate_limit", key="total")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These variables are fetched from XCom but are never used within the send_task_to_celery_worker task. They should be removed to improve code clarity and avoid confusion.



app.send_task("worker.get_github_data")
app.send_task("worker.get_github_data", kwargs={"start_in_repo_num": 1000, "batch_size": 500})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The arguments start_in_repo_num and batch_size are hardcoded. This reduces the flexibility of the DAG. Consider making these values configurable, for instance, by using Airflow variables or passing them as part of the DAG run configuration. The hardcoded start_in_repo_num: 1000 is particularly concerning as it might not be the intended starting point for all runs.

@@ -0,0 +1,23 @@
import logging
import os
from math import log
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log function from the math module is imported but is not used. Unused imports should be removed to keep the code clean.

Comment on lines 99 to 107
except GithubException as ge:
if ge.status == 403:
print(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
else:
print(f"GitHub API error {ge.status} for repo, skipping...")
logging.exception(f"GitHub API error {ge.status} for repo, skipping...")
continue
except Exception as e:
print(f"Error accessing repo data: {e}, skipping...")
logging.exception(f"Error accessing repo data: {e}, skipping...")
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

You're using logging.exception here, but a task-specific logger logger was obtained via get_task_logger. For consistency and to ensure logs have the proper Celery task context, you should use logger.exception instead. This is already done correctly in other parts of the file (e.g., line 156).

Suggested change
except GithubException as ge:
if ge.status == 403:
print(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
else:
print(f"GitHub API error {ge.status} for repo, skipping...")
logging.exception(f"GitHub API error {ge.status} for repo, skipping...")
continue
except Exception as e:
print(f"Error accessing repo data: {e}, skipping...")
logging.exception(f"Error accessing repo data: {e}, skipping...")
continue
except GithubException as ge:
if ge.status == 403:
logger.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
else:
logger.exception(f"GitHub API error {ge.status} for repo, skipping...")
continue
except Exception as e:
logger.exception(f"Error accessing repo data: {e}, skipping...")
continue

Comment on lines 122 to 125
except Exception as validation_error:
print(f"Validation error for repo {github_data_points.get('full_name')}: {validation_error}")
print("Skipping this repo and continuing")
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These print statements should be converted to logger calls (e.g., logger.error or logger.warning) to be consistent with the logging improvements made elsewhere in this file. This ensures all output is handled by the configured logging system.

Suggested change
except Exception as validation_error:
print(f"Validation error for repo {github_data_points.get('full_name')}: {validation_error}")
print("Skipping this repo and continuing")
continue
except Exception as validation_error:
logger.error(f"Validation error for repo {github_data_points.get('full_name')}: {validation_error}")
logger.warning("Skipping this repo and continuing")
continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant