Harden telemetry delivery and refine recipe metadata#2443
Open
Harden telemetry delivery and refine recipe metadata#2443
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends Olive's telemetry pipeline so workflow runs emit a new OliveRecipe signal with redacted recipe metadata, while also tightening cache handling and documenting the updated telemetry behavior. It fits into the workflow, CLI, and telemetry layers that capture how Olive runs recipes across local and CI environments.
Changes:
- Adds recipe-result telemetry from workflow execution and CLI entry points, including recipe hashes, model/source metadata, host/target metadata, and redacted config/package override snapshots.
- Reworks telemetry delivery details, including cache-path overrides, JSON-line cache storage, Windows file locking, unreadable flush preservation, and CI recipe-only filtering.
- Adds focused workflow and telemetry tests, plus privacy-documentation updates.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
test/workflows/test_workflow_run.py |
Adds coverage for recipe-result metadata emitted by workflow runs. |
test/test_telemetry.py |
Adds tests for cache-path overrides, CI filtering, unreadable flush handling, and Windows file locking. |
olive/workflows/run/run.py |
Builds and emits recipe telemetry metadata during workflow execution. |
olive/telemetry/utils.py |
Reworks cross-platform exclusive file locking for telemetry cache access. |
olive/telemetry/telemetry.py |
Adds OliveRecipe, CI filtering, and cache/flush behavior changes. |
olive/telemetry/telemetry_extensions.py |
Adds a helper for logging recipe results. |
olive/telemetry/library/telemetry_logger.py |
Adds locking around singleton logger setup and configurable service name. |
olive/telemetry/library/options.py |
Extends exporter options with service_name. |
olive/telemetry/constants.py |
Updates the encoded telemetry connection string. |
olive/cli/run.py |
Passes workflow-run recipe metadata into telemetry. |
olive/cli/base.py |
Passes generated-CLI recipe metadata from higher-level commands. |
docs/Privacy.md |
Updates the privacy text for the new recipe telemetry behavior. |
Contributor
Author
|
/AzurePipelines help |
Supported commands
See additional documentation. |
Contributor
Author
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Replace separate _lock and _callback_condition with a single _condition (threading.Condition) that protects all shared state: _shutdown, _is_flushing, _callbacks_item_count, _events_logged. The two-lock design had a lock order inversion: on_payload_transmitted acquired _lock then _callback_condition, while wait_for_callbacks acquired _callback_condition then _lock (via is_flushing). This could deadlock under concurrent flush + callback scenarios. Using one lock eliminates the ordering issue entirely and simplifies the code — the on_payload_transmitted callback no longer needs nested lock acquisition. Files changed: - olive/telemetry/telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wait_for_callbacks: Hold _condition lock continuously from condition check through wait() to prevent missing notifications. Previously the lock was released between checking and waiting, allowing notify_all() to fire in the gap. TelemetryLogger: Add RLock to __new__ and get_default_logger to prevent race conditions when multiple threads create the singleton simultaneously. Uses RLock because get_default_logger calls __new__ which both need the same lock. Files changed: - olive/telemetry/telemetry.py - olive/telemetry/library/telemetry_logger.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Remove base64 encoding from cache: write raw JSON lines instead. The base64 layer added 33% size overhead and prevented human inspection during debugging with no security benefit (cache dir is user-owned). Cache file is now directly readable. 2. Simplify cache flush: remove the .flush file rename dance (atomic rename, restore on failure, stale file cleanup). For ~5 events/session, simple lock-read-send-delete is sufficient. On failure the cache file persists for next retry. 3. Simplify Telemetry singleton: remove double-checked locking in __new__. The lock is cheap and called once at startup; the outer check saved a lock acquisition but added complexity. Net: -83 lines. Files changed: - olive/telemetry/telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 1: Remove duplicate callback count increment in on_payload_transmitted when _is_flushing is true. The count was incremented both in the early-return path and in the finally block, causing wait_for_callbacks to think events completed before they actually did. Now only the finally block increments (which always runs, even on return). Fix 2: Replace flush_path.replace(cache_path) with new _restore_flush_file() method that appends flush entries into the cache file instead of overwriting it. This prevents losing events written by another process while a flush was in progress. Files changed: - olive/telemetry/telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Emit a single OliveRecipe event per workflow so recipe usage and failures can be measured without double-counting nested action failures. Keep CI in recipe-only mode, make Olive app naming explicit, update the telemetry ingestion key, and harden the cache/locking path that the new telemetry depends on. Files changed: - docs/Privacy.md - olive/cli/base.py - olive/cli/run.py - olive/telemetry/constants.py - olive/telemetry/library/options.py - olive/telemetry/library/telemetry_logger.py - olive/telemetry/telemetry.py - olive/telemetry/telemetry_extensions.py - olive/telemetry/utils.py - olive/workflows/run/run.py - test/test_telemetry.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace optional runtime imports with importlib-based lookups so the recent telemetry changes stay lint-clean without adding new noqa markers. Keep the focused telemetry tests import-sorted and ready for CI. Files changed: - olive/cli/base.py - olive/workflows/run/run.py - test/test_telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep exporter options focused on transport/export concerns and move service.name defaults into the logger/resource layer where they belong. This keeps Olive's explicit app name override separate from the shared logger fallback and removes unnecessary plumbing. Files changed: - olive/telemetry/library/options.py - olive/telemetry/library/telemetry_logger.py - test/test_telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep Olive''s explicit service-name override in the logger path, but restore the previous shared-library fallback and compatibility behavior so this cleanup does not broaden unrelated API or default changes. Files changed: - olive/telemetry/library/options.py - olive/telemetry/library/telemetry_logger.py - test/test_telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trim the branch back to telemetry behavior and the minimal plumbing needed to support it. Restore the original local-import patterns in CLI and workflow code, keep only targeted lint suppressions where those restored lines are required, and simplify the telemetry logger app-name plumbing without changing the feature behavior. Files changed: - olive/cli/base.py - olive/cli/run.py - olive/telemetry/library/telemetry_logger.py - olive/workflows/run/run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix the correctness and security issues raised on PR #2441 by handling empty telemetry cache-dir overrides safely, preserving non-empty unreadable flush files instead of deleting them, restoring the legacy .json.flush naming pattern, handling non-pathlike recipe config inputs without masking the original error, and cleaning up the Windows ctypes import pattern for CodeQL. Files changed: - olive/telemetry/telemetry.py - olive/telemetry/utils.py - olive/workflows/run/run.py - test/test_telemetry.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drop the explicit service.name wiring test from test/test_telemetry.py since it is mostly implementation-detail coverage and does not protect the higher-value telemetry behavior changes on this branch. Files changed: - test/test_telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve the remaining github-advanced-security findings by removing unused PLC0415 noqa markers, keeping the optional-import behavior via minimal import cleanup, and updating the telemetry tests to satisfy the protected-access and consider-using-with lint comments without changing the tested behavior. Files changed: - olive/cli/base.py - olive/cli/run.py - olive/workflows/run/run.py - test/test_telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove dead base64 cache helpers from olive/telemetry/utils.py and move exception formatting next to the telemetry extension code that actually uses it. Keep the Win32 locking and cache-dir logic intact while reducing unrelated utility clutter. Files changed: - olive/telemetry/utils.py - olive/telemetry/telemetry_extensions.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Compute the CI environment flag once during Telemetry initialization and reuse it for recipe-only gating and heartbeat suppression instead of calling the check twice back-to-back. Files changed: - olive/telemetry/telemetry.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep target telemetry fields only for explicitly configured targets, add separate host fields, remove the ambiguous input model name hash, and add redacted config-overrides plus package-config hash metadata so recipe telemetry can show which overrides users actually provide without folding environment-specific package config into recipe_hash. Files changed: - docs/Privacy.md - olive/telemetry/telemetry.py - olive/workflows/run/run.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Log redacted package-config overrides instead of an opaque hash so Olive telemetry captures the specific package settings users changed, while still excluding package_config from recipe_hash and avoiding raw module-path leakage. Files changed: - docs/Privacy.md - olive/telemetry/telemetry.py - olive/workflows/run/run.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skip Hugging Face and Docker logins when PR secrets are unavailable or unresolved so Azure unit-test jobs do not fail before tests start on fork-style runs, while still preserving normal login behavior when valid credentials are present. Files changed: - .azure_pipelines/job_templates/build-docker-image-template.yaml - .azure_pipelines/job_templates/huggingface-login-template.yaml - .azure_pipelines/job_templates/olive-test-linux-gpu-template.yaml - .azure_pipelines/scripts/run_test.sh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Revert the Azure pipeline login guard changes because the pipeline behavior should match main and those changes are not necessary for the telemetry PR. Files changed: - .azure_pipelines/job_templates/build-docker-image-template.yaml - .azure_pipelines/job_templates/huggingface-login-template.yaml - .azure_pipelines/job_templates/olive-test-linux-gpu-template.yaml - .azure_pipelines/scripts/run_test.sh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Avoid duplicate OliveRecipe telemetry from Docker workflows by suppressing inner workflow recipe events and forwarding CI detection into workflow containers. Keep CI telemetry ephemeral by skipping cache setup, and make recipe metadata stable by avoiding filesystem-sensitive model/resource classification. Files changed: - docs/Privacy.md - olive/systems/docker/docker_system.py - olive/telemetry/constants.py - olive/telemetry/telemetry.py - olive/workflows/run/run.py - test/systems/docker/test_docker_system.py - test/test_telemetry.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep CLI workflow imports patchable so existing CLI tests and command mocks still intercept workflow execution. Update CLI test expectations for recipe telemetry metadata, and make the CI-sensitive workflow telemetry assertion deterministic. Files changed: - olive/cli/base.py - olive/cli/run.py - test/cli/test_cli.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep OliveRecipe focused on recipe outcome metadata and use the existing log_error path for workflow exceptions. This avoids duplicating exception fields on recipe events while preserving detailed formatted exception messages in error telemetry. Files changed: - olive/telemetry/telemetry_extensions.py - olive/telemetry/telemetry.py - olive/workflows/run/run.py - test/workflows/test_workflow_run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e47e0ce to
befbb08
Compare
Keep optional and workflow-heavy imports lazy so generated CLI commands preserve existing import behavior while still reporting recipe telemetry. Files changed: - olive/cli/base.py - olive/cli/run.py - olive/workflows/run/run.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use an explicit run() parameter for the inner Docker workflow runner instead of an environment variable so only the parent workflow emits the recipe result event. Files changed: - olive/workflows/run/run.py - olive/systems/docker/workflow_runner.py - olive/systems/docker/docker_system.py - olive/telemetry/constants.py - test/workflows/test_workflow_run.py - test/systems/docker/test_docker_system.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore local imports for Docker token lookup and telemetry file locking to avoid unnecessary module-level import churn. Files changed: - olive/systems/docker/docker_system.py - olive/telemetry/utils.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Same-repository replacement for #2441 so Azure CI runs with the normal
microsoft/OlivePR secret context.Summary
OliveRecipeand does not register local cache callbacksOliveRecipeby suppressing the inner container recipe event and forwarding CI detection into workflow containersOliveRecipemetadata so target fields are emitted only for explicit targets, host fields are tracked separately, and recipe/config telemetry captures redacted explicit overridespackage_config_overrides, while keepingpackage_configout ofrecipe_hashWhy
We want telemetry to be reliable under cache contention and offline retry, while keeping CI runs temporal and recipe-only. We also want the recipe signal to be useful for analyzing explicit overrides without mixing host/target semantics, relying on opaque hashes, or depending on local filesystem state.
Validation
lintrunner f -a -m mainpython -m compileall olive\\telemetry\\constants.py olive\\telemetry\\telemetry.py olive\\workflows\\run\\run.py olive\\systems\\docker\\docker_system.py test\\test_telemetry.py test\\workflows\\test_workflow_run.py test\\systems\\docker\\test_docker_system.pypytest test\\test_telemetry.py test\\workflows\\test_workflow_run.py test\\systems\\docker\\test_docker_system.py -q.azure_pipelinesand.githubdo not setOLIVE_DISABLE_TELEMETRY,disable_telemetry, or--disable_telemetry