Skip to content

Comments

Enhance telemetry performance#60

Merged
safaricd merged 3 commits intomainfrom
PRI-217
Feb 13, 2026
Merged

Enhance telemetry performance#60
safaricd merged 3 commits intomainfrom
PRI-217

Conversation

@safaricd
Copy link
Collaborator

@safaricd safaricd commented Feb 13, 2026

Change Description

Previous state

  • Telemetry configs were downloaded by always busting caches.
  • TTL for caching the telemetry config was set to only 5 minutes.
  • No timeout on downloading telemetry config from the bucket via the CDN.
  • GPU availability was determined by importing and re-initializing torch.
  • Dependency versions were determined by importing them.

Changes

  • Telemetry config download will timeout after 0.25 seconds and return defaults instead - always on.
  • Cache for downloading the config will never get busted.
  • NVIDIA GPUs detected using low-level nvidia-ml-py C interfaces.
  • We only import torch when absolutely necessary; as a third step and a fallback.
  • Assume torch is available in sys.modules because tabpfn already imports it.
  • Dependencies' versions are read from package metadata instead of importing them.
  • Fallback to already initialized torch from sys.modules, if available.

Results
The following table shows the performance difference with and without telemetry.

Phase Diff Mean % Overhead
fit 0.0026s 0.95%
predict -0.0007s -0.07%

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @safaricd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances telemetry performance by optimizing how configuration files are downloaded and cached, and by streamlining the detection of GPU types and retrieval of dependency versions. These changes aim to reduce overhead and improve the responsiveness of the application, particularly in scenarios involving GPU detection and package information lookup, without compromising the accuracy of the collected data.

Highlights

  • Telemetry Configuration Download: The telemetry configuration download process has been optimized by removing cache busting, increasing the cache TTL to 60 minutes, and adding a short timeout of 0.25 seconds to prevent blocking the main thread. The default telemetry state is now enabled.
  • GPU Detection Method: GPU detection has been improved for performance by prioritizing the use of the nvidia-ml-py library for NVIDIA GPUs, which is faster than importing and initializing torch. A fallback to torch is still available, but it now checks sys.modules first to avoid re-importing.
  • Dependency Version Retrieval: Dependency versions are now primarily retrieved using importlib.metadata.version from package metadata, which avoids the overhead of importing the entire package. A fallback to importing the module and checking __version__ is provided if metadata is not found.
  • New Dependency Added: The nvidia-ml-py library has been added as a new dependency to facilitate faster and more efficient NVIDIA GPU detection.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pyproject.toml
    • Added nvidia-ml-py to the project's dependencies.
  • src/tabpfn_common_utils/telemetry/core/config.py
    • Removed datetime and timezone imports as they are no longer needed for cache busting.
    • Increased the ttl_cache duration for download_config from 5 minutes to 60 minutes.
    • Removed the timestamp parameter from the requests.get call to prevent cache busting.
    • Changed the default enabled state in download_config from False to True.
    • Added a timeout of 0.25 seconds to the requests.get call for telemetry configuration download.
  • src/tabpfn_common_utils/telemetry/core/events.py
    • Imported importlib, importlib.metadata, PackageNotFoundError, and pynvml for improved package and GPU detection.
    • Modified _get_sklearn_version to correctly use scikit-learn as the distribution name for version retrieval.
    • Refactored _get_gpu_type to first attempt GPU detection using pynvml for NVIDIA GPUs, then fall back to a new _get_torch_gpu_type function.
    • Introduced _get_torch_gpu_type to handle PyTorch-based GPU detection, checking sys.modules for an existing torch import before attempting a new import.
    • Updated _get_package_version to primarily use importlib.metadata.version for version retrieval, with a fallback to importing the module if metadata is unavailable.
  • uv.lock
    • Added nvidia-ml-py package details, including its version, source, and wheel information.
    • Included nvidia-ml-py in the dependencies and requires-dist sections of the lock file.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@safaricd
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several well-motivated performance enhancements to the telemetry system. The changes, such as optimizing dependency version retrieval, improving GPU detection with nvidia-ml-py, and refining the configuration download process, are clear and effective. My review includes a few suggestions to improve exception handling by making it more specific, which will enhance robustness and debuggability. I also noted a minor docstring inconsistency. Overall, these are excellent improvements.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the performance of the telemetry system. The changes are well-thought-out, focusing on reducing overhead by avoiding costly imports and using more efficient methods for data collection. Key improvements include:

  • Caching telemetry configuration for a longer duration and removing cache-busting to improve network performance.
  • Introducing a timeout for fetching the telemetry config to prevent blocking.
  • Using the lightweight nvidia-ml-py library for faster GPU detection, with a fallback to torch.
  • Leveraging importlib.metadata to get package versions without importing the packages themselves.

The code is cleaner and more robust. I have a few suggestions to further improve error handling and documentation consistency.

nvmlDeviceGetName(nvmlDeviceGetHandleByIndex(i)) for i in range(counts)
]

# Because NVML runs very fast, we just return the device name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still worth to cache as it will run on every event?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the same thing, however in some rare cases, GPUs might be attached or detached to a VM, so we'd have to cache this information on-disk with a TTL. Anyway, given that NVML runs within 20-30 milliseconds, not really worth it ATM.



@lru_cache(maxsize=1)
def _get_torch_gpu_type() -> Optional[str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we could get this information eagerly at import time instead of lazily at event creation. Maybe we could even get the info straight from tabpfn.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might - an interesting area to explore in the future.

@safaricd safaricd merged commit a4d0d02 into main Feb 13, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants