Skip to content

feat: add TTL support for environment instance auto-cleanup #67

Merged
JacksonMei merged 12 commits intomainfrom
sky/api-service-ttl
Feb 27, 2026
Merged

feat: add TTL support for environment instance auto-cleanup #67
JacksonMei merged 12 commits intomainfrom
sky/api-service-ttl

Conversation

@lanmaoxinqing
Copy link
Copy Markdown
Collaborator

Overview

This PR introduces TTL (Time-To-Live) support for automatic environment instance cleanup, enabling automatic management of environment lifecycle.

Key Changes

Core Features

  • TTL Support: Added TTL configuration for environment instances with automatic cleanup based on creation time
  • Unified Cleanup Manager: Implemented centralized auto-cleanup manager in API service for managing environment instance lifecycle
  • Configurable Cleanup Interval: Added cleanupInterval configuration with improved time parsing functionality

Client Integration

  • asandbox Client: Extended asandbox client with TTL support
  • Simplified Response: Streamlined response structure for asandbox operations

Code Quality

  • Refactoring: Removed AEnvCleaner interface for cleaner architecture
  • Unit Tests: Added comprehensive unit tests for cleanup service functionality
  • Bug Fixes: Fixed timestamp handling for asandbox instance creation

Configuration Changes

  • TTL field type maintained as string for better API flexibility
  • Removed invalid label selectors for improved reliability

Testing

  • Unit tests added for cleanup service (cleanup_service_test.go)
  • Manual testing verified for TTL-based environment cleanup

Breaking Changes

None

- Add GetTTL() method to Env model to read TTL value from DeployConfig
- Update CreateInstanceByFunction to accept TTL parameter from request
- Refactor InitializeFunction API to use request body for parameters
- Add FunctionInitializeOptions struct for initialization options

This enables TTL-based instance lifecycle management through FaaS backend.
- Change EnvInstance.TTL from string to int64 (seconds)
- Update Instance.TTL in faas_model to int64
- Update PodListResponseData.TTL to int64
- Refactor cleanup_service.isExpired() to use int64 TTL:
  - Check TTL <= 0 instead of empty string
  - Use time.Duration(instance.TTL) * time.Second directly
  - Remove time.ParseDuration parsing
- Update logging format from %s to %d for TTL
- Update test files to use int64 seconds instead of duration strings

This simplifies TTL handling by using seconds as the standard unit
instead of duration strings.
- Add --cleanup-interval argument to api-service deployment
- Add cleanupInterval value to api-service/values.yaml (default: 5m)
- Improve time parsing in cleanup_service.isExpired():
  - Use time.DateTime constant instead of hardcoded format string
  - Add fallback to time.RFC3339 for compatibility
- Add comprehensive unit tests for time parsing:
  - TestIsExpiredWithDateTimeFormat: tests DateTime format parsing
  - TestIsExpiredWithRFC3339Format: tests RFC3339 fallback parsing
  - TestIsExpiredWithInvalidTimeFormat: tests invalid time format handling
  - TestIsExpiredWithInvalidTTLFormat: tests invalid TTL format handling
  - TestIsExpiredWithEmptyTTL: tests empty TTL handling
  - TestIsExpiredWithVariousTTLDurations: tests various TTL durations

All tests pass successfully.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @lanmaoxinqing, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the environment management system by introducing robust Time-To-Live (TTL) functionality for environment instances. It centralizes the automatic cleanup process within the API service, allowing instances to be automatically terminated after a specified duration. This change improves resource utilization, reduces manual overhead, and provides better control over environment lifecycles. The update also includes comprehensive testing, improved timestamp handling, and refined API interactions with the FaaS backend, ensuring a more reliable and maintainable system.

Highlights

  • TTL-based Auto-Cleanup: Implemented Time-To-Live (TTL) support for environment instances, enabling automatic cleanup based on their creation time and configured TTL.
  • Centralized Cleanup Manager: Introduced a unified AEnvCleanManager in the API service to orchestrate the lifecycle management of environment instances.
  • Configurable Cleanup Interval: Added a cleanupInterval configuration option, allowing administrators to define how frequently the cleanup service runs.
  • Cleanup Metrics: Integrated Prometheus metrics to track the success and failure rates of auto-cleanup operations.
  • Refactored Cleanup Service: Simplified the cleanup service architecture by removing the AEnvCleaner interface and directly integrating EnvInstanceService.
  • Enhanced Unit Testing: Added comprehensive unit tests for the AEnvCleanManager and its TTL expiration logic, utilizing mock services for isolation.
  • Improved Timestamp Handling: Corrected and standardized timestamp parsing and formatting across the FaaS client and schedule client to use time.RFC3339 and time.UnixMilli.
  • FaaS Client Integration: Extended the FaaS client to pass TTL information during instance creation and retrieve it when listing or getting instances.
  • Simplified FaaS API Responses: Refactored FaaS client to use dedicated structs for instance list and get responses, improving type safety and simplifying parsing.
  • Helm Chart Configuration: Updated Helm charts to expose the cleanupInterval as a configurable parameter for the API service.
  • Label Selector Correction: Removed an invalid label selector from the Redis Helm chart helper template.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • api-service/main.go
    • Updated AEnvCleanManager initialization to include Prometheus metrics for cleanup success and failure.
  • api-service/middleware/metrics.go
    • Added auto_cleanup_success_total and auto_cleanup_failure_total Prometheus counters.
    • Introduced IncrementCleanupSuccess and IncrementCleanupFailure functions to update these counters.
  • api-service/service/cleanup_service.go
    • Removed the AEnvCleaner interface, simplifying the cleanup manager's dependencies.
    • Modified AEnvCleanManager to directly depend on EnvInstanceService for instance operations.
    • Added WithMetrics method to AEnvCleanManager for optional metric integration.
    • Implemented performCleanup method to iterate through environment instances, check for TTL expiration, and trigger deletion.
    • Introduced isExpired helper function to determine if an instance's TTL has passed, supporting time.DateTime and time.RFC3339 formats for CreatedAt.
  • api-service/service/cleanup_service_test.go
    • Replaced the original TestNewCleanupService with a skipped test.
    • Introduced MockEnvInstanceService to facilitate isolated testing of cleanup logic.
    • Added unit tests for performCleanup covering scenarios like no instances, expired instances, terminated instances, delete errors, and list errors.
    • Added unit tests for isExpired covering various time and TTL formats, including invalid and empty cases.
  • api-service/service/faas_client.go
    • Modified CreateInstanceByFunction to accept a ttl parameter.
    • Updated InitializeFunction to use a new FunctionInitializeOptions struct, allowing TTL to be passed to the FaaS backend.
    • Corrected CreatedAt and UpdatedAt formatting in GetEnvInstance and ListEnvInstances to use time.UnixMilli and time.RFC3339.
    • Populated the TTL field in models.EnvInstance when retrieving instances from the FaaS backend.
    • Refactored ListInstances and GetInstance to use APIInstanceListResponse and APIInstanceResponse structs, respectively, for direct JSON unmarshalling.
  • api-service/service/faas_model/function.go
    • Added a TTL field to the Instance struct.
    • Defined APIInstanceResponse and APIInstanceListResponse structs for structured FaaS API responses.
    • Created FunctionInitializeOptions struct to pass initialization parameters, including DynamicRuntimeName and TTL.
  • api-service/service/schedule_client.go
    • Updated CreatedAt and UpdatedAt formatting in ListEnvInstances to use time.RFC3339.
  • deploy/api-service/templates/deployment.yaml
    • Added --cleanup-interval as a command-line argument to the API service container.
  • deploy/api-service/values.yaml
    • Introduced cleanupInterval configuration with a default value of "5m".
  • deploy/redis/templates/_helpers.tpl
    • Removed the app.kubernetes.io/instance label from redis.selectorLabels.
  • envhub/models/env.go
    • Added GetTTL method to the Env struct to safely extract TTL from DeployConfig, handling various data types.
Activity
  • Implemented TTL support for environment instances.
  • Created a centralized auto-cleanup manager.
  • Added Prometheus metrics for cleanup operations.
  • Refactored cleanup service architecture.
  • Developed comprehensive unit tests for the cleanup service.
  • Performed manual testing to verify TTL-based environment cleanup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TTL support for automatic environment instance cleanup, which is a great feature for managing resource lifecycle. However, a critical security concern has been identified: user-controlled identifiers (like environment names and instance IDs) are used to construct internal API URIs without proper sanitization, posing a risk of path traversal and SSRF against internal services. It is recommended to validate and sanitize all user-supplied identifiers before using them in URI construction. Additionally, please address a potential performance regression in the cleanup logic and a bug in the new unit tests.

Comment on lines +91 to +96
expiredInstance := &models.EnvInstance{
ID: "test-instance-1",
Status: "Running",
CreatedAt: "2025-01-01 10:00:00",
TTL: "1h",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The test for expired instances uses a future date (2025-01-01 10:00:00) for CreatedAt. The isExpired function checks if time.Now() is after the expiration time. Since time.Now() will be before 2025, this instance will be considered active, not expired, and the test will fail. To fix this, please use a date in the past.

This issue also affects other tests that check for expired instances, such as TestPerformCleanupWithDeleteError, TestIsExpiredWithDateTimeFormat, TestIsExpiredWithRFC3339Format, and TestIsExpiredWithVariousTTLDurations.

Suggested change
expiredInstance := &models.EnvInstance{
ID: "test-instance-1",
Status: "Running",
CreatedAt: "2025-01-01 10:00:00",
TTL: "1h",
}
expiredInstance := &models.EnvInstance{
ID: "test-instance-1",
Status: "Running",
CreatedAt: "2020-01-01 10:00:00",
TTL: "1h",
}

//}
// Synchronously call the function
instanceId, err := c.CreateInstanceByFunction(functionName, dynamicRuntimeName)
instanceId, err := c.CreateInstanceByFunction(functionName, dynamicRuntimeName, req.GetTTL())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The functionName parameter, which is derived from user-controlled req.Name and req.Version, is passed to CreateInstanceByFunction and eventually used to construct internal API URIs (e.g., in GetFunction and InitializeFunction) without proper sanitization. An attacker could provide a malicious name containing path traversal characters (e.g., ../../) to manipulate the internal API request path and potentially access or trigger actions on unintended internal endpoints.

ID: instance.InstanceID,
IP: instance.IP,
TTL: "", // No TTL field source available yet, can be added later
TTL: instance.TTL, // No TTL field source available yet, can be added later
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The id parameter, which comes from a user-controlled URL path, is used in GetInstance(id) to construct an internal API URI. Lack of sanitization allows an attacker to perform a path traversal attack by providing a crafted ID (e.g., ../../other-endpoint), potentially leading to unauthorized access to internal service functionality.

Comment on lines +88 to +92
envInstances, err := cm.envInstanceService.ListEnvInstances("")
if err != nil {
log.Printf("Failed to list environment instances: %v", err)
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of performCleanup fetches all environment instances and then filters for expired ones on the client side. This could lead to performance issues if the number of instances is large, as it puts a load on both the network and the api-service memory. The previous implementation delegated filtering to the backend (.../pods?filter=expired), which is a more scalable approach. Consider re-introducing server-side filtering to avoid fetching all instances into memory.

Copy link
Copy Markdown

@JunJunBot JunJunBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Collaborator

@JacksonMei JacksonMei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@JacksonMei JacksonMei merged commit 4c90418 into main Feb 27, 2026
2 checks passed
@JacksonMei JacksonMei deleted the sky/api-service-ttl branch February 27, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants