Skip to content

[SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS#55029

Open
sarutak wants to merge 3 commits intoapache:masterfrom
sarutak:disable-event-log-scan
Open

[SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS#55029
sarutak wants to merge 3 commits intoapache:masterfrom
sarutak:disable-event-log-scan

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Mar 26, 2026

What changes were proposed in this pull request?

This PR adds a new configuration spark.history.fs.update.scanDisabledPathPatterns that allows disabling initial/periodic log directory scanning by matching directory paths against regular expressions in the Spark History Server.

When a log directory's qualified path matches any of the configured patterns, checkForLogs skips scanning that directory entirely. Applications in scan-disabled directories are not discovered by periodic scanning but can still be loaded on demand via the on-demand loading feature (rolling event logs only). Once accessed by appId, mergeApplicationListing is invoked to populate accurate metadata immediately, and the application is protected from stale entry cleanup in subsequent scan cycles.

Example configuration:

spark.history.fs.logDirectory=hdfs:///spark-logs,s3a://bucket/spark-logs
spark.history.fs.update.scanDisabledPathPatterns=s3a://.*,gs://.*

Why are the changes needed?

On object storages such as S3 and GCS, the listStatus API call used during initial/periodic scanning can be expensive. When a log directory contains a large number of event logs, each scan cycle incurs significant cost and latency. While spark.history.fs.update.interval can be increased to reduce scan frequency, the initial scan at SHS startup is still unavoidable.

For users who access applications exclusively by appId, periodic scanning provides no benefit. This change allows such users to disable scanning for specific directories and rely solely on on-demand loading, eliminating listStatus costs entirely.

Using regular expressions provides flexibility to disable scanning by URI scheme (e.g., s3a://.*), by specific path (e.g., .*long-term.*), or any combination.

Does this PR introduce any user-facing change?

Yes. A new configuration is added:

Config Default Description
spark.history.fs.update.scanDisabledPathPatterns (none) Comma-separated list of regular expressions matched against log directory paths. Directories whose full path matches any pattern will not be scanned.

Caveats when scanning is disabled for a directory:

  • Applications do not appear in the listing until accessed by appId
  • When accessed, accurate metadata (Spark version, user, duration, etc.) is populated immediately via mergeApplicationListing
  • Logs that are never accessed are not subject to the cleaner; use external lifecycle management (e.g., S3 Lifecycle Policies) for those
  • In-progress application state is not automatically updated (re-access updates it)

How was this patch tested?

Added tests in FsHistoryProviderSuite.

Was this patch authored or co-authored using generative AI tooling?

Kiro CLI / Opus 4.6

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @sarutak .

It's possible. However, there is one limitation of On-Demand Loading feature when I delivered that feature.

As I mentioned here, the Spark version is missing until the next scan. If this PR disables the scan, the Spark version field will be missing forever.

<td>
Comma-separated list of URI schemes for which periodic log directory scanning is disabled
(e.g., <code>s3a,gs</code>). Directories with these schemes rely on on-demand loading
(<a href="https://issues.apache.org/jira/browse/SPARK-52914">SPARK-52914</a>, rolling event
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to mention JIRA ID because it's the official Spark feature now.

@dongjoon-hyun
Copy link
Member

Can we generalize this differently instead of scheme? For example, support a short-term storage location vs a long-term storage location even in the same scheme like the following?

spark.history.fs.logDirectory=s3a://spark-events,s3a://spark-events-long-term

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 26, 2026

I'm thinking about something like regular expressions to match the path to disable the scan operation. For example, you can also use regular expressin to specify the file system schemes to disable scan, @sarutak .

@sarutak sarutak changed the title [SPARK-56234][CORE] Support disabling initial/periodic scan for specific URI schemes in SHS [SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS Mar 26, 2026
@sarutak
Copy link
Member Author

sarutak commented Mar 26, 2026

I'm thinking about something like regular expressions to match the path to disable the scan operation. For example, you can also use regular expressin to specify the file system schemes to disable scan, @sarutak .

Thanks for the feedback. Sounds good to me.

@sarutak
Copy link
Member Author

sarutak commented Mar 26, 2026

Hi, @sarutak .

It's possible. However, there is one limitation of On-Demand Loading feature when I delivered that feature.

As I mentioned here, the Spark version is missing until the next scan. If this PR disables the scan, the Spark version field will be missing forever.

Thank you for pointing this out, @dongjoon-hyun. I've addressed this by calling mergeApplicationListing directly when on-demand loading is triggered for a scan-disabled directory.


val SCAN_DISABLED_PATH_PATTERNS =
ConfigBuilder("spark.history.fs.update.scanDisabledPathPatterns")
.doc("Comma-separated list of regular expressions matched against log directory paths. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation?

.createOptional

val SCAN_DISABLED_PATH_PATTERNS =
ConfigBuilder("spark.history.fs.update.scanDisabledPathPatterns")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this behind spark.history.fs.update.batchSize in order to collect spark.history.fs.update config group?

logSourceName, logSourceFullPath))))
// Resolve the full path to find which directory actually contains this log.
// logPath is a relative name (e.g., "eventlog_v2_app1"), so we need to scan
// all directories to find the correct one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds misleading because this PR title is Support disabling log directory scanning .... Could you revise the comment?

so we need to scan all directories to find the correct one.

try {
EventLogFileReader(dirFs, dirFs.getFileStatus(fullPath)).foreach { reader =>
mergeApplicationListing(reader, clock.getTimeMillis(), enableOptimizations = true)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enough to load a single Spark job history correctly? If then, can we remove else statement logic (489 ~ 498) completely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mergeApplicationListing is sufficient to load accurate metadata. However, I'd like to keep the else branch (dummy metadata path) for scan-enabled directories to preserve the existing SPARK-52914 behavior. Removing it would change the on-demand loading behavior for all directories, not just scan-disabled ones.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the disabled directories, does SHS still clean up the logs like the following? This PR seems to disable the cleaner features too. Who is going to manage those directories?

spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.interval
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum

@sarutak
Copy link
Member Author

sarutak commented Mar 27, 2026

For the disabled directories, does SHS still clean up the logs like the following? This PR seems to disable the cleaner features too. Who is going to manage those directories?

spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.interval
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum

For scan-disabled directories, the cleaner behavior depends on whether the application has been accessed:

  • Accessed apps: mergeApplicationListing registers the app in the listing DB, so the cleaner (cleanLogs) will manage those logs normally (maxAge, maxNum).
  • Never-accessed apps: These are not in the listing DB, so the cleaner has no knowledge of them.

For never-accessed logs, external lifecycle management is recommended(e.g., S3 Lifecycle Policies, GCS Object Lifecycle Management). This is documented in the config description and monitoring.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants