Implement GitHub Repository Integration for LLM Context

This feature introduces the capability for users to seamlessly integrate content from GitHub repositories directly into their LLM prompts. By providing the LLM with specific, up-to-date code, documentation, or other textual content from a GitHub repository, we aim to significantly enhance the LLM's understanding, reduce hallucinations, and improve the relevance and accuracy of its responses, especially for code-related assistance.

### 1. Core Concept

*   **GitHub Node:** A new type of node that users can add to their canvas graph. It represents a specific GitHub repository linked by the user.

### 2. GitHub Node Functionality & Lifecycle

The GitHub Node will manage the connection to and content of a specified GitHub repository:

*   **Node Creation & Repository Input:**
    *   Users can add a new "GitHub Repository" node to their canvas.
    *   The node requires a GitHub repository URL as input. This must support both **public** and **private** repositories.
    *   Accepted URL formats: `HTTPS` (e.g., `https://github.com/user/repo.git`) and `SSH` (e.g., `git@github.com:user/repo.git`).
    *   Initial state: The node will be visually marked as "Unpulled" or "Disconnected".

*   **Authentication for Private Repositories:**
    *   **Global Settings:** A new section within the application's global settings will be dedicated to "Repository Credentials" or "GitHub SSH Keys".
    *   Users can securely add one or more **SSH private keys**. The system will store these keys encrypted at rest and use them to authenticate when cloning/pulling private repositories.

*   **Pulling Repository Content:**
    *   **Trigger:** A prominent "Pull" button will be present on the GitHub Node's UI. Clicking this button initiates the cloning/pulling process.
    *   **Backend Process:**
        *   The backend will clone the specified GitHub repository to a local directory.
        *   Each cloned repository will be stored in a unique folder, named with a **UUID**. This UUID will be associated with the user, the repository URL, and the node instance in the database to ensure isolation and traceability.
        *   Authentication for private repos will leverage the SSH keys configured in settings.
        *   **Progress Feedback:** The node's UI should display a loading indicator or progress bar during the pulling process.
    *   **Visual Update:** Upon successful completion, the node's visual state updates (e.g., "Pulled," green indicator, last pulled timestamp).
    *   **Error Handling:** If the pull fails (e.g., invalid URL, authentication error, network issue), the node should visually indicate an error state, and provide specific, user-friendly feedback (e.g., "Authentication Failed," "Repository Not Found").

*   **Content Filtering & Configuration (Advanced Settings Pop-up):**
    *   **Access:** Once a repository has been successfully pulled, a "Configure Content" or "Filter Files" button will appear on the GitHub Node. Clicking this opens a **new modal/pop-up window**.
    *   **Purpose:** This pop-up allows users to precisely control which files and directories from the cloned repository will be considered for LLM context.
    *   **Filtering Logic:** Implement a powerful filtering mechanism inspired by `.gitignore` syntax. Users can define rules to include or exclude files based on:
        *   **File Names:** `README.md`
        *   **File Extensions:** `*.py`, `!*.min.js`
        *   **Folder Paths:** `src/components/`, `!node_modules/`
        *   **Specific Paths:** `/config/secrets.json`
        *   **Rule Precedence:** Rules defined later should override earlier rules.
    *   **Default Behavior:** If no custom configuration is provided, the system should default to including common text-based source code and documentation files, while excluding common binary files, large data files, and typical build/dependency directories (e.g., `node_modules`, `dist`, `.git`).
    *   **Preview:** The pop-up could show a real-time preview of which files would be included based on the current filtering rules.

### 3. LLM Prompt Integration

*   **Mention Syntax in Chat Input:**
    *   In the AI Chat view's text input area, users can reference specific files from a linked GitHub node or a global repository using a dedicated mention syntax:
        `@git:<repo-alias>:<file-path>`
        *   `<repo-alias>`: A user-defined short name for the attached GitHub node or the global repository (e.g., "my-project", "docs").
        *   `<file-path>`: The full path to the file within the repository (e.g., `src/main.py`, `docs/api.md`).
    *   **Autocompletion:** As the user types `@git:`, the system should provide intelligent autocompletion:
        *   First, suggest available `repo-alias`es (from attached nodes and global repos).
        *   Once a `repo-alias` is selected, suggest file paths within that repository, respecting the configured content filters.

*   **Global Repositories (Application Settings):**
    *   A new section in the application's global settings will allow users to define "Global GitHub Repositories."
    *   These repositories are always available for mention in *any* canvas or chat, without needing to attach a specific GitHub Node.
    *   Each global repository will also have its own content filtering configuration (similar to node-specific filtering).
    *   Authentication for global private repos will use the SSH keys configured in the global settings.

*   **Backend Prompt Construction:**
    *   **Trigger:** When a user sends a chat message.
    *   **Context Check:** The backend will perform the following checks:
        *   Is a GitHub Node attached to the current chat's Generation Node via an "attachment" handle?
        *   Does the user's message contain `@git:` mentions (referencing either an attached node's repo or a global repo)?
    *   **Content Aggregation:**
        *   For each successfully mentioned file, its content is retrieved from the locally cloned repository.
        *   If the user intends to include the *entire filtered repository* (e.g., by mentioning the repo alias without a specific file, or if a "include all filtered files" option is set in the filtering config), all files respecting the content filters will be concatenated.
    *   **LLM-Friendly Concatenation:** The retrieved file contents must be concatenated into the LLM prompt in a structured, clear format that helps the LLM understand file boundaries and origins.
        *   **Example Format:**
            ```
            --- Start of file: <repo-alias>/<file-path> ---
            <file content>
            --- End of file: <repo-alias>/<file-path> ---
            ```
            This format aids the LLM in differentiating between multiple files.

*   **Token Management (CRITICAL):**
    *   **Pre-flight Token Calculation:** Before sending the prompt to the LLM, the system must calculate the total token count, including the user's message, aggregated GitHub content, and relevant chat history.
    *   **Handling Token Overages:**
        *   **Prioritization:** A clear strategy is needed for truncation. Generally, the user's explicit prompt takes highest priority. GitHub content might be truncated next, followed by older chat history.
        *   **Truncation:** If the total token count exceeds the LLM's limit, the aggregated GitHub content should be truncated first. This could involve:
            *   Truncating individual files.
            *   Prioritizing smaller files over larger ones.
        *   **User Feedback:** Inform the user if content was truncated due to token limits (e.g., via a toast notification or a message in the chat).

### 4. Advanced User Settings

*   **"Always Resend GitHub Content" (Per Node/Conversation Setting):**
    *   **Location:** This option should be configurable within the settings of each GitHub Node, or potentially as a setting associated with the specific conversation.
    *   **Behavior:**
        *   **Enabled (True):** If enabled, any GitHub content that was included in the *first* turn of a conversation (due to mentions or node attachment) will be automatically re-included in subsequent turns' prompts when the chat history is reconstructed. This ensures persistent context for the LLM but significantly increases token usage and cost.
        *   **Disabled (False - Recommended Default):** If disabled, GitHub content is *only* included for the specific chat turn where it is explicitly mentioned in the user's message. For subsequent turns, only the standard chat history (without the potentially large GitHub content) is sent, saving tokens.

*   **"Auto Pull" (Per Repository Setting):**
    *   **Location:** Configurable for each GitHub Node and for each Global Repository in settings.
    *   **Purpose:** To keep the locally cloned repositories up-to-date with their remote counterparts.
    *   **Careful Design:** This feature must be implemented with extreme care to avoid overloading the system, especially as the number of linked repositories grows.
    *   **Proposed Triggers (Prioritized):**
        *   **On Canvas Load/Open:** The system checks for updates when a user opens a canvas containing a GitHub Node with "Auto Pull" enabled.
        *   **On Mention:** When a user starts typing an `@git:` mention in the chat input for a repository with "Auto Pull" enabled, triggering a check.
        *   **Manual "Pull" (Existing):** The user can always manually trigger a pull via the node's button.
        *   **(Lower Priority/Future): Periodical Background Check:** If implemented, this should be very infrequent (e.g., every 6-12 hours) and staggered across all repositories to distribute load. This would require a robust background worker service.

### 5. Technical Considerations & Edge Cases

*   **Security:**
    *   **SSH Key Management:** Implement robust security measures for storing and accessing SSH private keys (e.g., encrypted at rest, strict access controls, never exposed to frontend).
    *   **SSRF Prevention:** Validate and sanitize all repository URLs to prevent Server-Side Request Forgery attacks.
    *   **Malicious Content:** Implement safeguards against cloning excessively large repositories, infinite redirects, or potentially malicious content (e.g., file size limits per clone, timeouts).
*   **Performance:**
    *   **Cloning Large Repositories:** Provide clear progress feedback. Consider optimizing for shallow clones if full history is not needed.
    *   **File Reading:** Efficiently read and concatenate potentially many files.
    *   **Token Calculation:** Optimize token counting for large context windows.
*   **Storage:**
    *   Plan for local disk space usage for cloned repositories.
    *   Implement a cleanup strategy for inactive or deleted repository clones.
*   **Error Handling:**
    *   Comprehensive error handling for all stages: invalid repo links, authentication failures, network issues, file not found after mention, parsing errors in content.
    *   Clear error messaging to the user in the UI.
*   **UI/UX:**
    *   Clear visual states for the GitHub Node (unpulled, pulling, pulled, error).
    *   Intuitive and responsive autocompletion for `@git:` mentions.
    *   User-friendly interface for the content filtering pop-up.
    *   Visual feedback for token limits and content truncation.
*   **Scalability:** Consider the impact of many users cloning and updating numerous repositories on backend resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement GitHub Repository Integration for LLM Context #23

1. Core Concept

2. GitHub Node Functionality & Lifecycle

3. LLM Prompt Integration

4. Advanced User Settings

5. Technical Considerations & Edge Cases

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Implement GitHub Repository Integration for LLM Context #23

Description

1. Core Concept

2. GitHub Node Functionality & Lifecycle

3. LLM Prompt Integration

4. Advanced User Settings

5. Technical Considerations & Edge Cases

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions