Draft
Conversation
…letion system Addresses performance issues where dataflow analysis gets stuck in HeldTaskCompletion.completeHeldTasks() for 10+ hours, particularly in large Python codebases with complex interprocedural dependencies. ## Key Changes ### HeldTaskCompletion.scala - Added detailed logging for fixed-point iteration progress and timing - Circuit breaker protection: max 1000 iterations to prevent infinite loops - Performance warnings for slow iterations (>1 minute) and operations (>30 seconds) - Sample held task information logging (sink types, call depths, source paths) - Memory usage estimation and collection size monitoring ### deduplicateTableEntries() Enhancement - Granular timing breakdown for groupBy, sortBy, and tie-breaking operations - Large collection size warnings (>10,000 entries) with performance impact analysis - Identification of largest groups causing hash computation bottlenecks - Reduction ratio tracking and memory usage estimates ### Engine.scala - Enhanced backwards() method logging with source-sink context - Task submission ratio tracking (held vs executed tasks) - Sample source/sink node information for debugging problematic combinations - Performance warnings for slow analysis (>2 minutes) ### ExtendedCfgNode.scala - Query scale warnings for large source×sink combinations (>100,000) - Sample source/sink logging for identifying problematic node combinations - Total analysis timing and result count tracking - Early warning system for potentially expensive queries ## Performance Monitoring Features - **Multi-level thresholds**: Debug, Info, Warn levels for different performance characteristics - **Circuit breakers**: Prevent runaway processes with configurable limits - **Memory monitoring**: Estimate memory usage for large collections - **Progress tracking**: Monitor convergence in fixed-point iterations - **Bottleneck identification**: Pinpoint exact operations causing slowdowns ## Debugging Capabilities - Identify which specific source-sink combinations cause performance issues - Track held task processing patterns and result set sizes - Monitor memory usage patterns during deduplication operations - Analyze iteration convergence behavior in pathological cases This logging system provides the visibility needed to diagnose and resolve the exponential performance degradation observed in large-scale dataflow analysis scenarios. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Value classes cannot have instance fields, so moved the logger to a companion object. This resolves the compilation error: "Value classes may not define non-parameter field" Changes: - Removed instance logger field from ExtendedCfgNode value class - Added companion object with static logger - Updated all logger references to use ExtendedCfgNode.logger 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
… node logging This commit addresses the 10+ hour performance bottleneck in Python repository scanning by implementing comprehensive parallelization and enhanced debugging capabilities. Key improvements: - Remove circuit breaker from HeldTaskCompletion as analysis showed bottleneck was in deduplicateTableEntries(), not the main loop - Add multi-level parallelization: parallel table deduplication, parallel groupBy operations, and parallel group processing - Implement hash key caching to avoid repeated expensive calculations during deduplication - Add configurable performance thresholds via system properties for optimal tuning - Replace unhelpful class name/ID logging with meaningful source code context (variable names, code snippets, file locations) - Add comprehensive performance monitoring with timing logs and bottleneck identification - Enhance memory usage warnings for large collection processing Performance features: - Configurable thresholds: joern.dataflow.parallel.table.threshold, joern.dataflow.parallel.dedup.threshold, joern.dataflow.parallel.groups.threshold - Intelligent processing mode selection (PARALLEL vs SEQUENTIAL) based on data size - Enhanced logging shows meaningful context: Identifier'username' [username] @ user.py:42 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by Bito
This pull request enhances the logging mechanisms in the dataflow analysis components and system, including the `ExtendedCfgNode`, `Engine`, and `HeldTaskCompletion` classes. It introduces detailed logging for task management, sources, sinks, and deduplication processes, improving observability and facilitating better debugging, performance monitoring, and optimization.