[SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS by sarutak · Pull Request #55029 · apache/spark

sarutak · 2026-03-26T05:28:31Z

What changes were proposed in this pull request?

This PR adds a new configuration spark.history.fs.update.scanDisabledPathPatterns that allows disabling initial/periodic log directory scanning by matching directory paths against regular expressions in the Spark History Server.

When a log directory's qualified path matches any of the configured patterns, checkForLogs skips scanning that directory entirely. Applications in scan-disabled directories are not discovered by periodic scanning but can still be loaded on demand via the on-demand loading feature (rolling event logs only). Once accessed by appId, mergeApplicationListing is invoked to populate accurate metadata immediately, and the application is protected from stale entry cleanup in subsequent scan cycles.

Example configuration:

spark.history.fs.logDirectory=hdfs:///spark-logs,s3a://bucket/spark-logs
spark.history.fs.update.scanDisabledPathPatterns=s3a://.*,gs://.*

Why are the changes needed?

On object storages such as S3 and GCS, the listStatus API call used during initial/periodic scanning can be expensive. When a log directory contains a large number of event logs, each scan cycle incurs significant cost and latency. While spark.history.fs.update.interval can be increased to reduce scan frequency, the initial scan at SHS startup is still unavoidable.

For users who access applications exclusively by appId, periodic scanning provides no benefit. This change allows such users to disable scanning for specific directories and rely solely on on-demand loading, eliminating listStatus costs entirely.

Using regular expressions provides flexibility to disable scanning by URI scheme (e.g., s3a://.*), by specific path (e.g., .*long-term.*), or any combination.

Does this PR introduce any user-facing change?

Yes. A new configuration is added:

Config	Default	Description
`spark.history.fs.update.scanDisabledPathPatterns`	(none)	Comma-separated list of regular expressions matched against log directory paths. Directories whose full path matches any pattern will not be scanned.

Caveats when scanning is disabled for a directory:

Applications do not appear in the listing until accessed by appId
When accessed, accurate metadata (Spark version, user, duration, etc.) is populated immediately via mergeApplicationListing
Logs that are never accessed are not subject to the cleaner; use external lifecycle management (e.g., S3 Lifecycle Policies) for those
In-progress application state is not automatically updated (re-access updates it)

How was this patch tested?

Added tests in FsHistoryProviderSuite.

Was this patch authored or co-authored using generative AI tooling?

Kiro CLI / Opus 4.6

dongjoon-hyun

Hi, @sarutak .

It's possible. However, there is one limitation of On-Demand Loading feature when I delivered that feature.

#51604

As I mentioned here, the Spark version is missing until the next scan. If this PR disables the scan, the Spark version field will be missing forever.

#51604 (comment)

dongjoon-hyun · 2026-03-26T06:04:10Z

docs/monitoring.md

+    <td>
+      Comma-separated list of URI schemes for which periodic log directory scanning is disabled
+      (e.g., <code>s3a,gs</code>). Directories with these schemes rely on on-demand loading
+      (<a href="https://issues.apache.org/jira/browse/SPARK-52914">SPARK-52914</a>, rolling event


We don't need to mention JIRA ID because it's the official Spark feature now.

dongjoon-hyun · 2026-03-26T06:07:50Z

Can we generalize this differently instead of scheme? For example, support a short-term storage location vs a long-term storage location even in the same scheme like the following?

spark.history.fs.logDirectory=s3a://spark-events,s3a://spark-events-long-term

dongjoon-hyun · 2026-03-26T06:40:46Z

I'm thinking about something like regular expressions to match the path to disable the scan operation. For example, you can also use regular expressin to specify the file system schemes to disable scan, @sarutak .

sarutak · 2026-03-26T07:30:48Z

I'm thinking about something like regular expressions to match the path to disable the scan operation. For example, you can also use regular expressin to specify the file system schemes to disable scan, @sarutak .

Thanks for the feedback. Sounds good to me.

sarutak · 2026-03-26T07:32:59Z

Hi, @sarutak .

It's possible. However, there is one limitation of On-Demand Loading feature when I delivered that feature.

[SPARK-52914][CORE] Support On-Demand Log Loading for rolling logs in History Server #51604

As I mentioned here, the Spark version is missing until the next scan. If this PR disables the scan, the Spark version field will be missing forever.

#51604 (comment)

Thank you for pointing this out, @dongjoon-hyun. I've addressed this by calling mergeApplicationListing directly when on-demand loading is triggered for a scan-disabled directory.

dongjoon-hyun · 2026-03-26T16:24:07Z

core/src/main/scala/org/apache/spark/internal/config/History.scala


+  val SCAN_DISABLED_PATH_PATTERNS =
+    ConfigBuilder("spark.history.fs.update.scanDisabledPathPatterns")
+    .doc("Comma-separated list of regular expressions matched against log directory paths. " +


Indentation?

dongjoon-hyun · 2026-03-26T16:25:52Z

core/src/main/scala/org/apache/spark/internal/config/History.scala

    .createOptional

+  val SCAN_DISABLED_PATH_PATTERNS =
+    ConfigBuilder("spark.history.fs.update.scanDisabledPathPatterns")


Shall we move this behind spark.history.fs.update.batchSize in order to collect spark.history.fs.update config group?

dongjoon-hyun · 2026-03-26T16:31:44Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

-        logSourceName, logSourceFullPath))))
+    // Resolve the full path to find which directory actually contains this log.
+    // logPath is a relative name (e.g., "eventlog_v2_app1"), so we need to scan
+    // all directories to find the correct one.


This sounds misleading because this PR title is Support disabling log directory scanning .... Could you revise the comment?

so we need to scan all directories to find the correct one.

dongjoon-hyun · 2026-03-26T16:34:44Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+      try {
+        EventLogFileReader(dirFs, dirFs.getFileStatus(fullPath)).foreach { reader =>
+          mergeApplicationListing(reader, clock.getTimeMillis(), enableOptimizations = true)
+        }


Is this enough to load a single Spark job history correctly? If then, can we remove else statement logic (489 ~ 498) completely?

I think mergeApplicationListing is sufficient to load accurate metadata. However, I'd like to keep the else branch (dummy metadata path) for scan-enabled directories to preserve the existing SPARK-52914 behavior. Removing it would change the on-demand loading behavior for all directories, not just scan-disabled ones.

dongjoon-hyun

For the disabled directories, does SHS still clean up the logs like the following? This PR seems to disable the cleaner features too. Who is going to manage those directories?

spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.interval
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum

sarutak · 2026-03-27T01:54:25Z

For the disabled directories, does SHS still clean up the logs like the following? This PR seems to disable the cleaner features too. Who is going to manage those directories?
spark.history.fs.cleaner.enabled
spark.history.fs.cleaner.interval
spark.history.fs.cleaner.maxAge
spark.history.fs.cleaner.maxNum

For scan-disabled directories, the cleaner behavior depends on whether the application has been accessed:

Accessed apps: mergeApplicationListing registers the app in the listing DB, so the cleaner (cleanLogs) will manage those logs normally (maxAge, maxNum).
Never-accessed apps: These are not in the listing DB, so the cleaner has no knowledge of them.

For never-accessed logs, external lifecycle management is recommended(e.g., S3 Lifecycle Policies, GCS Object Lifecycle Management). This is documented in the config description and monitoring.md.

Support disabling initial/periodic scan for specific URI schemes in SHS

f30e337

dongjoon-hyun reviewed Mar 26, 2026

View reviewed changes

Use regex instead of URI scheme

ace6f10

sarutak changed the title ~~[SPARK-56234][CORE] Support disabling initial/periodic scan for specific URI schemes in SHS~~ [SPARK-56234][CORE] Support disabling log directory scanning by path pattern in SHS Mar 26, 2026

dongjoon-hyun reviewed Mar 26, 2026

View reviewed changes

Address comment

fc5eeab

Conversation

sarutak commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 26, 2026

Uh oh!

dongjoon-hyun commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarutak commented Mar 26, 2026

Uh oh!

sarutak commented Mar 26, 2026

Uh oh!

dongjoon-hyun Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

sarutak Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

sarutak commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarutak commented Mar 26, 2026 •

edited

Loading

dongjoon-hyun commented Mar 26, 2026 •

edited

Loading