Enhance Airflow and Celery integration with improved logging and Docker configuration by LuisJG8 · Pull Request #34 · LuisJG8/github_etl

LuisJG8 · 2026-02-17T16:06:36Z

No description provided.

…er configuration

LuisJG8 · 2026-02-17T16:06:44Z

/gemini review

gemini-code-assist · 2026-02-17T16:06:52Z

Summary of Changes

Hello @LuisJG8, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the robustness and maintainability of the Airflow and Celery integration. It streamlines the Airflow DAG structure, introduces a standardized and configurable logging system for Celery workers, and improves Docker deployment by configuring log rotation. These changes collectively lead to more reliable task execution and better observability of the distributed system.

Highlights

Airflow DAG Refactoring: The main Airflow DAG function run_github_data_queue was renamed to run_queue, and an internal task run_queue was renamed to send_task_to_celery_worker. This task now explicitly passes start_in_repo_num and batch_size as keyword arguments to the Celery task.
Standardized Celery Logging: A new file celery_logging_config.py was introduced to centralize and standardize Celery worker logging. This configuration sets the log level via an environment variable and defines a consistent log format.
Celery Worker Logging Integration: The worker.py file now imports and utilizes the new celery_logging_config to apply the standardized logging. All print statements previously used for logging have been replaced with logger.info or logging.exception for proper log handling.
Docker Compose Logging Configuration: The docker-compose.yml file was updated to include json-file logging driver configuration for the celery_worker service, enabling log rotation with max-size and max-file options.
GitHub API Initialization in Celery Task: GitHub API token and instance initialization were moved inside the get_github_data Celery task in worker.py, ensuring that API instances are created per task execution rather than globally. The github_instance parameter was also removed from the task signature.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

airflow/dags/run_queue.py
- Renamed the main DAG function run_github_data_queue to run_queue.
- Renamed the internal task run_queue to send_task_to_celery_worker.
- Modified the send_task_to_celery_worker task to pass start_in_repo_num and batch_size as kwargs to the Celery task.
- Updated task dependencies to reflect the new task name.
celery_logging_config.py
- Added a new file to define a standardized logging configuration for Celery, including log level from environment variables and a specific format.
docker-compose.yml
- Configured json-file logging for the celery_worker service, including max-size and max-file options for log rotation.
rb_queue/rabbitmq.py
- Corrected a typo in a code comment from 'becayse' to 'because'.
worker.py
- Imported logging and celery_logging_config for standardized logging.
- Replaced print statements with logger.info and logging.exception for improved logging practices.
- Moved GitHub API token and instance initialization inside the get_github_data task.
- Removed the github_instance parameter from the get_github_data task signature.
- Added a break condition for counter >= 5 within the repository processing loop.
- Removed commented-out old code related to task distribution.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the Airflow and Celery integration by improving logging and Docker configuration. The changes include adding a centralized Celery logging configuration, setting up log rotation in Docker, and refactoring the Airflow DAG and Celery worker. My review has identified a few issues: a critical piece of debug code left in the worker that limits processing, a change that breaks the data flow in a Celery chord, inefficient initialization of GitHub clients, and some inconsistencies in logging. I've also pointed out an unused import and hardcoded values that could be improved.

gemini-code-assist · 2026-02-17T16:13:05Z

worker.py

+            if counter >= 5:
+                break


This block unconditionally breaks the loop after 5 iterations. This appears to be leftover debugging code and will prevent the task from processing the full batch_size. It should be removed.

gemini-code-assist · 2026-02-17T16:13:05Z

worker.py

 # bind = True allows to get task data, like task id
 @app.task(bind=True)
-def get_github_data(self, start_in_repo_num: int = 0, batch_size: int = 500, github_instance: Github = gh):
+def get_github_data(self, start_in_repo_num: int = 0, batch_size: int = 500):


The return mylist statement was removed from this function. This breaks the Celery chord (build_repo_chord) which relies on this task's return value to pass to aggregate_results. Without it, aggregate_results will receive None and the final result will be incorrect. Please add return mylist at the end of the function.

gemini-code-assist · 2026-02-17T16:13:05Z

worker.py

+    api_token, api_token_two = os.getenv("GITHUB_API_TOKEN"), os.getenv("GITHUB_API_TOKEN_SECOND_ACCOUNT")
+    auth, auth_two = Auth.Token(api_token), Auth.Token(api_token_two)
+    gh, gh_two = Github(auth=auth), Github(auth=auth_two)


The GitHub client instances (gh, gh_two) are initialized inside the get_github_data task. This is inefficient as it will happen on every task execution. These clients should be initialized once at the module level to be reused across task runs, as it was in the previous version of the code.

gemini-code-assist · 2026-02-17T16:13:05Z

airflow/dags/run_queue.py


-
-        app.send_task("worker.get_github_data")
+        app.send_task("worker.get_github_data", kwargs={"start_in_repo_num": 1000, "batch_size": 500})


The values for start_in_repo_num and batch_size are hardcoded. This makes the task less flexible and harder to configure. Consider defining these as constants at the top of the file or loading them from a configuration source like Airflow variables.

gemini-code-assist · 2026-02-17T16:13:05Z

celery_logging_config.py

@@ -0,0 +1,23 @@
+import logging
+import os
+from math import log 


The log function from the math module is imported but never used in this file. It should be removed to keep the code clean.

gemini-code-assist · 2026-02-17T16:13:05Z

worker.py

            except GithubException as ge:
                if ge.status == 403:
-                    print(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
+                    logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")


For consistency with the rest of the file and to leverage Celery's task-specific logging, please use the task logger logger.exception instead of logging.exception.

Suggested change

logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

logger.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

gemini-code-assist · 2026-02-17T16:13:05Z

worker.py

+                    logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
                else:
-                    print(f"GitHub API error {ge.status} for repo, skipping...")
+                    logging.exception(f"GitHub API error {ge.status} for repo, skipping...")


For consistency, please use logger.exception instead of logging.exception.

Suggested change

logging.exception(f"GitHub API error {ge.status} for repo, skipping...")

logger.exception(f"GitHub API error {ge.status} for repo, skipping...")

gemini-code-assist · 2026-02-17T16:13:05Z

worker.py

                continue
            except Exception as e:
-                print(f"Error accessing repo data: {e}, skipping...")
+                logging.exception(f"Error accessing repo data: {e}, skipping...")


For consistency, please use logger.exception instead of logging.exception.

Suggested change

logging.exception(f"Error accessing repo data: {e}, skipping...")

logger.exception(f"Error accessing repo data: {e}, skipping...")

gemini-code-assist

Code Review

This pull request introduces several enhancements to the Airflow and Celery integration, primarily focusing on improved logging and Docker configuration. The changes to configure Celery logging and Docker log rotation are excellent improvements. However, I've identified a few issues in the implementation, including a critical bug in worker.py where a change breaks the data aggregation logic. There are also some leftover debugging statements and inconsistencies in logging that should be addressed. My review provides specific suggestions to fix these issues and improve the overall code quality.

gemini-code-assist · 2026-02-17T16:20:26Z

worker.py



    # s3_url = save_to_s3(data=repo_collection, file_directory="github_repos/test.json")
    logger.info(f"Processed {counter} repositories")


The get_github_data task no longer returns the mylist of collected repositories because return mylist was removed. This breaks the chord defined in build_repo_chord, as the aggregate_results task will receive a list of None values instead of the repository data. Please restore return mylist at the end of this function to fix the aggregation logic.

gemini-code-assist · 2026-02-17T16:20:26Z

worker.py

+            if counter >= 5:
+                break


This condition if counter >= 5: break appears to be a debugging remnant. It will cause the loop to terminate after processing only 5 repositories, making the batch_size parameter ineffective. This should be removed to ensure the full batch is processed.

gemini-code-assist · 2026-02-17T16:20:26Z

airflow/dags/run_queue.py

        rate_limit = context["ti"].xcom_pull(task_ids="check_rate_limit", key="remaining")
        max_total_api_calls = context["ti"].xcom_pull(task_ids="check_rate_limit", key="total")


These variables are fetched from XCom but are never used within the send_task_to_celery_worker task. They should be removed to improve code clarity and avoid confusion.

gemini-code-assist · 2026-02-17T16:20:26Z

airflow/dags/run_queue.py


-
-        app.send_task("worker.get_github_data")
+        app.send_task("worker.get_github_data", kwargs={"start_in_repo_num": 1000, "batch_size": 500})


The arguments start_in_repo_num and batch_size are hardcoded. This reduces the flexibility of the DAG. Consider making these values configurable, for instance, by using Airflow variables or passing them as part of the DAG run configuration. The hardcoded start_in_repo_num: 1000 is particularly concerning as it might not be the intended starting point for all runs.

gemini-code-assist · 2026-02-17T16:20:26Z

celery_logging_config.py

@@ -0,0 +1,23 @@
+import logging
+import os
+from math import log 


The log function from the math module is imported but is not used. Unused imports should be removed to keep the code clean.

gemini-code-assist · 2026-02-17T16:20:26Z

worker.py

            except GithubException as ge:
                if ge.status == 403:
-                    print(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
+                    logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
                else:
-                    print(f"GitHub API error {ge.status} for repo, skipping...")
+                    logging.exception(f"GitHub API error {ge.status} for repo, skipping...")
                continue
            except Exception as e:
-                print(f"Error accessing repo data: {e}, skipping...")
+                logging.exception(f"Error accessing repo data: {e}, skipping...")
                continue


You're using logging.exception here, but a task-specific logger logger was obtained via get_task_logger. For consistency and to ensure logs have the proper Celery task context, you should use logger.exception instead. This is already done correctly in other parts of the file (e.g., line 156).

Suggested change

except GithubException as ge:

if ge.status == 403:

print(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

else:

print(f"GitHub API error {ge.status} for repo, skipping...")

logging.exception(f"GitHub API error {ge.status} for repo, skipping...")

continue

except Exception as e:

print(f"Error accessing repo data: {e}, skipping...")

logging.exception(f"Error accessing repo data: {e}, skipping...")

continue

except GithubException as ge:

if ge.status == 403:

logger.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

else:

logger.exception(f"GitHub API error {ge.status} for repo, skipping...")

continue

except Exception as e:

logger.exception(f"Error accessing repo data: {e}, skipping...")

continue

gemini-code-assist · 2026-02-17T16:20:26Z

worker.py

            except Exception as validation_error:
                print(f"Validation error for repo {github_data_points.get('full_name')}: {validation_error}")
                print("Skipping this repo and continuing")
                continue


These print statements should be converted to logger calls (e.g., logger.error or logger.warning) to be consistent with the logging improvements made elsewhere in this file. This ensures all output is handled by the configured logging system.

Suggested change

except Exception as validation_error:

print(f"Validation error for repo {github_data_points.get('full_name')}: {validation_error}")

print("Skipping this repo and continuing")

continue

except Exception as validation_error:

logger.error(f"Validation error for repo {github_data_points.get('full_name')}: {validation_error}")

logger.warning("Skipping this repo and continuing")

continue

Enhance Airflow and Celery integration with improved logging and Dock…

9a354f0

…er configuration

LuisJG8 merged commit 9a3c7f6 into MyProjects Feb 17, 2026
1 check failed

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes



		app.send_task("worker.get_github_data")
		app.send_task("worker.get_github_data", kwargs={"start_in_repo_num": 1000, "batch_size": 500})

	logging.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")
	logger.exception(f"Skipping blocked repo (403): {repo.full_name if hasattr(repo, 'full_name') else 'unknown'}")

	logging.exception(f"GitHub API error {ge.status} for repo, skipping...")
	logger.exception(f"GitHub API error {ge.status} for repo, skipping...")

	logging.exception(f"Error accessing repo data: {e}, skipping...")
	logger.exception(f"Error accessing repo data: {e}, skipping...")



		# s3_url = save_to_s3(data=repo_collection, file_directory="github_repos/test.json")
		logger.info(f"Processed {counter} repositories")

		rate_limit = context["ti"].xcom_pull(task_ids="check_rate_limit", key="remaining")
		max_total_api_calls = context["ti"].xcom_pull(task_ids="check_rate_limit", key="total")

Conversation

LuisJG8 commented Feb 17, 2026

Uh oh!

LuisJG8 commented Feb 17, 2026

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant