[SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS#55029
[SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS#55029sarutak wants to merge 3 commits intoapache:masterfrom
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Hi, @sarutak .
It's possible. However, there is one limitation of On-Demand Loading feature when I delivered that feature.
As I mentioned here, the Spark version is missing until the next scan. If this PR disables the scan, the Spark version field will be missing forever.
docs/monitoring.md
Outdated
| <td> | ||
| Comma-separated list of URI schemes for which periodic log directory scanning is disabled | ||
| (e.g., <code>s3a,gs</code>). Directories with these schemes rely on on-demand loading | ||
| (<a href="https://issues.apache.org/jira/browse/SPARK-52914">SPARK-52914</a>, rolling event |
There was a problem hiding this comment.
We don't need to mention JIRA ID because it's the official Spark feature now.
|
Can we generalize this differently instead of |
|
I'm thinking about something like regular expressions to match the path to disable the scan operation. For example, you can also use regular expressin to specify the file system schemes to disable scan, @sarutak . |
Thanks for the feedback. Sounds good to me. |
Thank you for pointing this out, @dongjoon-hyun. I've addressed this by calling |
|
|
||
| val SCAN_DISABLED_PATH_PATTERNS = | ||
| ConfigBuilder("spark.history.fs.update.scanDisabledPathPatterns") | ||
| .doc("Comma-separated list of regular expressions matched against log directory paths. " + |
| .createOptional | ||
|
|
||
| val SCAN_DISABLED_PATH_PATTERNS = | ||
| ConfigBuilder("spark.history.fs.update.scanDisabledPathPatterns") |
There was a problem hiding this comment.
Shall we move this behind spark.history.fs.update.batchSize in order to collect spark.history.fs.update config group?
| logSourceName, logSourceFullPath)))) | ||
| // Resolve the full path to find which directory actually contains this log. | ||
| // logPath is a relative name (e.g., "eventlog_v2_app1"), so we need to scan | ||
| // all directories to find the correct one. |
There was a problem hiding this comment.
This sounds misleading because this PR title is Support disabling log directory scanning .... Could you revise the comment?
so we need to scan all directories to find the correct one.
| try { | ||
| EventLogFileReader(dirFs, dirFs.getFileStatus(fullPath)).foreach { reader => | ||
| mergeApplicationListing(reader, clock.getTimeMillis(), enableOptimizations = true) | ||
| } |
There was a problem hiding this comment.
Is this enough to load a single Spark job history correctly? If then, can we remove else statement logic (489 ~ 498) completely?
There was a problem hiding this comment.
I think mergeApplicationListing is sufficient to load accurate metadata. However, I'd like to keep the else branch (dummy metadata path) for scan-enabled directories to preserve the existing SPARK-52914 behavior. Removing it would change the on-demand loading behavior for all directories, not just scan-disabled ones.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
For the disabled directories, does SHS still clean up the logs like the following? This PR seems to disable the cleaner features too. Who is going to manage those directories?
spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.interval
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum
For scan-disabled directories, the cleaner behavior depends on whether the application has been accessed:
For never-accessed logs, external lifecycle management is recommended(e.g., S3 Lifecycle Policies, GCS Object Lifecycle Management). This is documented in the config description and |
What changes were proposed in this pull request?
This PR adds a new configuration
spark.history.fs.update.scanDisabledPathPatternsthat allows disabling initial/periodic log directory scanning by matching directory paths against regular expressions in the Spark History Server.When a log directory's qualified path matches any of the configured patterns,
checkForLogsskips scanning that directory entirely. Applications in scan-disabled directories are not discovered by periodic scanning but can still be loaded on demand via the on-demand loading feature (rolling event logs only). Once accessed by appId,mergeApplicationListingis invoked to populate accurate metadata immediately, and the application is protected from stale entry cleanup in subsequent scan cycles.Example configuration:
Why are the changes needed?
On object storages such as S3 and GCS, the
listStatusAPI call used during initial/periodic scanning can be expensive. When a log directory contains a large number of event logs, each scan cycle incurs significant cost and latency. Whilespark.history.fs.update.intervalcan be increased to reduce scan frequency, the initial scan at SHS startup is still unavoidable.For users who access applications exclusively by appId, periodic scanning provides no benefit. This change allows such users to disable scanning for specific directories and rely solely on on-demand loading, eliminating
listStatuscosts entirely.Using regular expressions provides flexibility to disable scanning by URI scheme (e.g.,
s3a://.*), by specific path (e.g.,.*long-term.*), or any combination.Does this PR introduce any user-facing change?
Yes. A new configuration is added:
spark.history.fs.update.scanDisabledPathPatternsCaveats when scanning is disabled for a directory:
mergeApplicationListingHow was this patch tested?
Added tests in
FsHistoryProviderSuite.Was this patch authored or co-authored using generative AI tooling?
Kiro CLI / Opus 4.6