HDDS-14913. Implement Scalable CSV Export for Unhealthy Containers in Recon UI. by ArafatKhan2198 · Pull Request #10162 · apache/ozone

ArafatKhan2198 · 2026-04-30T09:18:13Z

What changes were proposed in this pull request?

The Recon UI had no way for administrators to export unhealthy container data (Missing, Under-Replicated, Over-Replicated, etc.) at scale. For clusters with millions of containers, any streaming export over a long-running HTTP connection would be killed by network infrastructure (firewalls, load balancers, proxies) before completion.

Solution: Asynchronous Background Export with Queue

Instead of streaming data directly to the browser, this PR implements a server-side background job system that:

Builds the export on the Recon node itself
Splits large exports into 500K-record CSV chunks
Archives them into a single TAR file
Lets the user download the TAR from the browser when ready

Backend Changes

New: `ExportJob` model (`ExportJob.java`)

A data class representing one export job with fields:

jobId (UUID), userId, state (container state), status (QUEUED → RUNNING → COMPLETED/FAILED)
queuePosition, totalRecords, estimatedTotal, progressPercent
filePath (path to TAR on disk), submittedAt, startedAt, completedAt, errorMessage

New: `ExportJobManager.java` — the core engine

A Guice Singleton that runs for the lifetime of the Recon server:

Single-threaded executor — one export runs at a time, eliminating concurrent Derby database access
Global queue (max 4 jobs) — incoming requests beyond the limit return HTTP 429
3-second cooldown between jobs (on the worker thread, transparent to users)
CSV splitting — every 500K records creates a new part file (e.g., part001.csv, part002.csv)
TAR archiving — all part files are archived using Archiver.create() into export_{state}_{userId}_{shortJobId}.tar
Progress tracking — runs a COUNT(*) before the cursor opens to calculate estimatedTotal; totalRecords increments live
Cleanup — temp CSV files and their directory are deleted after TAR is created
Synchronized submitJob() — prevents race conditions when multiple users submit simultaneously
getQueuePosition() — walks LinkedHashMap (insertion-order) to return 1-indexed position

`ContainerEndpoint.java` — new REST endpoints

Method	Path	Purpose
POST	/api/v1/containers/unhealthy/export	Submit a new export job
GET	/api/v1/containers/unhealthy/export	List all jobs (new)
GET	/api/v1/containers/unhealthy/export/{jobId}	Get one job's status
GET	/api/v1/containers/unhealthy/export/{jobId}/download	Stream the TAR to browser
DELETE	/api/v1/containers/unhealthy/export/{jobId}	Cancel a job

Queue-full (429) errors return JSON instead of Jetty's HTML error page.

`ContainerHealthSchemaManager.java`

Added getUnhealthyContainersCursor() — jOOQ lazy cursor for streaming DB records without holding them all in JVM heap
Added getUnhealthyContainersCount() — fast COUNT(*) used before the cursor opens for progress estimation

`ReconServerConfigKeys.java`

New config keys:

ozone.recon.export.worker.threads (default: 1)
ozone.recon.export.directory (default: /tmp/recon/exports)
ozone.recon.export.max.jobs.total (default: 10)

Frontend Changes (`containers.tsx`, `container.types.ts`)

New: Export Tab (tab key `'6'`)

A dedicated Export tab is added to the Containers page alongside Missing, Under-Replicated, etc. It contains:

Submit Controls:

Dropdown to select container state (Missing, Under-Replicated, Over-Replicated, Mis-Replicated, Replica Mismatch)
"Export CSV" button — POSTs to backend and immediately shows the job in the table below

Active Exports table (hidden when empty):

Columns: Job ID (8-char + full ID tooltip), State, Status (colored Tag), Queue Position (#1, #2...), Progress bar + record count
No pagination — always compact

Completed Exports table (always visible, paginated):

Columns: Job ID, State, Status, Records, Submitted, Started, Completed, Action
Download button (only for COMPLETED jobs) — triggers TAR file download to browser
Error message tooltip (for FAILED jobs)
Timestamps formatted as MMM D, HH:mm:ss

Polling:

3-second interval using setInterval + useRef — starts when Export tab is opened or a job is submitted
Auto-stops when no QUEUED or RUNNING jobs remain

Error handling:

429 queue-full error shows a 6-second toast with the specific message
All errors show clean messages (no raw HTML from Jetty)
Guard in fetchTabData prevents undefined API calls when Export tab is active

## What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14913

How was this patch tested?

Log Changes -


2026-04-30 15:16:48 2026-04-30 09:46:48,962 [pool-56-thread-1] INFO api.ExportJobManager: Starting export job ac16b513-f3f0-4e2d-a124-f208155697c3
2026-04-30 15:16:54 2026-04-30 09:46:54,625 [pool-56-thread-1] INFO api.ExportJobManager: Export job ac16b513-f3f0-4e2d-a124-f208155697c3 will process approximately 3040000 records
2026-04-30 15:16:54 2026-04-30 09:46:54,628 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part1
2026-04-30 15:17:28 2026-04-30 09:47:28,413 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part2
2026-04-30 15:17:57 2026-04-30 09:47:57,420 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part3
2026-04-30 15:17:58 2026-04-30 09:47:58,876 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part4
2026-04-30 15:18:00 2026-04-30 09:48:00,646 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part5
2026-04-30 15:18:02 2026-04-30 09:48:02,488 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part6
2026-04-30 15:18:04 2026-04-30 09:48:04,261 [pool-56-thread-1] INFO api.ExportJobManager: Created CSV file: part7
2026-04-30 15:18:04 2026-04-30 09:48:04,429 [pool-56-thread-1] INFO api.ExportJobManager: Export job ac16b513-f3f0-4e2d-a124-f208155697c3 wrote 3040000 records across 7 files
2026-04-30 15:18:05 2026-04-30 09:48:05,730 [pool-56-thread-1] INFO api.ExportJobManager: Created TAR archive: /tmp/recon/exports/export_missing_webui_ac16b513.tar
2026-04-30 15:18:05 2026-04-30 09:48:05,755 [pool-56-thread-1] INFO api.ExportJobManager: Deleted temporary CSV files for job ac16b513-f3f0-4e2d-a124-f208155697c3
2026-04-30 15:18:05 2026-04-30 09:48:05,755 [pool-56-thread-1] INFO api.ExportJobManager: Completed export job ac16b513-f3f0-4e2d-a124-f208155697c3 (3040000 records)

CSV_Export_Feature.mp4

… Recon UI.

devmadhuu · 2026-04-30T11:17:18Z

@ArafatKhan2198 as discussed, please design the solution server based for single Recon user. We don't have user based logins in Recon. We should not localize the logic at browser for job progress. All browser windows opened in multiple machines opening the recon page should see the same job and its progress. At a time only job should be allowed to run and remaining 2 should go in queue.

sumitagrawl

@ArafatKhan2198 Thanks for working, given few comments

sumitagrawl · 2026-05-04T08:48:38Z

+  private long estimatedTotal;
+
+  @JsonProperty("filePath")
+  private String filePath;


exposing internal file path may have security risk, we should not return file path

sumitagrawl · 2026-05-04T08:52:50Z

+   */
+  public static final String OZONE_RECON_EXPORT_DIRECTORY =
+      "ozone.recon.export.directory";
+  public static final String OZONE_RECON_EXPORT_DIRECTORY_DEFAULT = "/tmp/recon/exports";


we should avoid tmp path, and keep the same Recon metapath with export

sumitagrawl · 2026-05-04T08:59:57Z

+      LOG.error("Failed to create export directory: {}", exportDirectory, e);
+    }
+
+    LOG.info("ExportJobManager initialized with single-threaded queue (max {} jobs)", MAX_QUEUE_SIZE);


restart of Recon will leave this file as orphan, IMO, we should able to get those jobs again via UI

sumitagrawl · 2026-05-04T09:01:02Z

+      LOG.error("Failed to create export directory: {}", exportDirectory, e);
+    }
+
+    LOG.info("ExportJobManager initialized with single-threaded queue (max {} jobs)", MAX_QUEUE_SIZE);


For failed job files remaining over disk, we should remove it, can clean failed jobs

sumitagrawl · 2026-05-04T09:02:47Z

+    LOG.info("ExportJobManager initialized with single-threaded queue (max {} jobs)", MAX_QUEUE_SIZE);
+  }
+
+  public synchronized String submitJob(String userId, String state, int limit, long prevKey) {


need remove userId

sumitagrawl · 2026-05-04T09:10:03Z

+    ExportJob job = new ExportJob(jobId, userId, state, limit, prevKey);
+    // Filename format: export_{state}_{userId}_{shortJobId}.tar
+    String shortJobId = jobId.substring(0, 8);
+    String filePath = exportDirectory + "/export_" + state.toLowerCase() + "_" + userId + "_" + shortJobId + ".tar";


how are you ensuring file are unique ? why you need uniqueness ? may be timestamp can be added

sumitagrawl · 2026-05-04T09:14:38Z

+        int fileIndex = 1;
+        long totalRecords = 0;
+        long recordsInCurrentFile = 0;
+        final int CHUNK_SIZE = 500_000;


rename to RECORD_SIZE

sumitagrawl · 2026-05-04T09:19:32Z

+    }
+  }
+
+  private void deleteDirectory(Path directory) {


reuse exisiting directory delete recursive

sumitagrawl · 2026-05-04T09:30:41Z

+    } finally {
+      // 3-second cooldown before the next queued job is picked up by the single worker thread.
+      try {
+        Thread.sleep(3000);


This may not provide any advantage for sleep, as tar logic already provide some delay where waiting task can get lock and proceed.
Also max download file count will help avoid this.

sumitagrawl · 2026-05-04T09:45:41Z

  replicaMismatchCount: number;
+}
+
+export type ExportJobStatus = 'QUEUED' | 'RUNNING' | 'COMPLETED' | 'FAILED';


UI related,

Download button should not show filename, it can be downloaded

Active Export / Completed Export: can be combined, and add fields, submitted time, started time.

DELETE -- should act as cancel and/or deleted completed jobs ?

devmadhuu

Thanks @ArafatKhan2198 for improving the patch. However few comments, pls check.
Also I am not sure of any cleanup or TTL for completed jobs. How these exported files will be cleaned up, what is the lifecycle ? They can continue to accumulate indefinitely ?

devmadhuu · 2026-05-04T10:11:03Z

+    }
+
+    // Check global queue size limit
+    synchronized (jobQueue) {


This is very confusing. This method acquires this object lock, then lock over jobQueue. Any other thread that acquires jobQueue first and then tries to call a synchronized method creates a deadlock condition.

Pls check this still not solved.

Deadlock nested synchronized(this) + synchronized(jobQueue) Fixed. Removed synchronized from submitJob and moved all queue checks and mutations into a single synchronized(jobQueue) block. One lock everywhere, no nesting, no possible lock-order deadlock.

devmadhuu · 2026-05-04T10:12:47Z

+@Singleton
+public class ExportJobManager {
+  private static final Logger LOG = LoggerFactory.getLogger(ExportJobManager.class);
+  private static final int MAX_QUEUE_SIZE = 4;


Better put a comment here, why hardcoded as 4 ?

Replaced with maxQueueSize field read from ozone.recon.export.max.jobs.total config (default 4). Javadoc explains the choice: single-threaded worker, ~5 unhealthy states, one-TAR-per-state rule.

devmadhuu · 2026-05-04T10:13:51Z

+   */
+  public static final String OZONE_RECON_EXPORT_MAX_JOBS_TOTAL =
+      "ozone.recon.export.max.jobs.total";
+  public static final int OZONE_RECON_EXPORT_MAX_JOBS_TOTAL_DEFAULT = 10;


Are these used anywhere ? Also contradicts with your thread queue size ?

The constant is now actively read in ExportJobManager constructor and wired to maxQueueSize. Default updated to 4 to match the design.

devmadhuu · 2026-05-04T10:15:10Z

+   * Default: 1
+   */
+  public static final String OZONE_RECON_EXPORT_WORKER_THREADS =
+      "ozone.recon.export.worker.threads";


Is this used ?

Removed the constant entirely. Export is intentionally single-threaded to avoid concurrent Derby access a worker-threads config would be misleading.

devmadhuu · 2026-05-04T10:16:13Z

+ * Manages asynchronous CSV export jobs.
+ */
+@Singleton
+public class ExportJobManager {


Add some unit tests for this class

Unit tests for ExportJobManager Added TestExportJobManager covering: submit success, empty result, duplicate-state (running + completed), failed-state retry, queue-full, cancel running, cancel completed, unknown job, queue position, startup cleanup, and filename pattern. Plus TestExportJob for the download counter and path derivation.

devmadhuu · 2026-05-04T10:18:59Z

+
+    // Controls how many rows Derby returns per JDBC round-trip.
+    // Default is 10,000 rows.
+    query.fetchSize(10000);


This is hardcoded again. In old PR , it was fixed.

fetchSize(10000) hardcoded Fixed. Now reads from ozone.recon.unhealthy.container.fetch.size (default 10,000) wired in ContainerHealthSchemaManager constructor via OzoneConfiguration.

devmadhuu · 2026-05-04T10:19:34Z

+   * @param prevKey Container ID offset for cursor-based pagination
+   * @return Total count of matching containers
+   */
+  public long getUnhealthyContainersCount(


Check javadoc above this method. Seems something wrong.

devmadhuu · 2026-05-04T10:20:55Z

+    this.totalRecords = totalRecords;
+  }
+
+  public void incrementTotalRecords() {


Not sure, what is the purpose of this. The current code passes a local long totalRecords counter and calls setTotalRecords on every row. Using incrementTotalRecords() removes the local counter

ArafatKhan2198 · 2026-05-05T04:58:23Z

@devmadhuu @sumitagrawl please take another look

devmadhuu

@ArafatKhan2198 Few comments still unresolved. pls check.

devmadhuu · 2026-05-05T09:40:22Z

+  private int maxDownloads;
+
+  @JsonProperty("downloadCount")
+  private int downloadCount;


We should make this to AtomicInteger to avoid any race conditions.

Replaced int downloadCount with AtomicInteger and introduced a single tryReserveDownload() method that atomically checks and increments in one CAS loop, so concurrent download requests can't race past the limit.

devmadhuu · 2026-05-05T09:41:41Z

+        final int RECORDS_PER_FILE = 500_000;
+
+        BufferedWriter writer = null;
+        FileOutputStream fos = null;


Better use fos in try finally block to avoid any resource leak.

fos in try/finally Fixed. Added an inner try/finally around the BufferedWriter construction if wrapping fails, fos.close() is called immediately so no file descriptor leaks.

HDDS-14913. Implement Scalable CSV Export for Unhealthy Containers in…

86ea7b3

… Recon UI.

devmadhuu self-requested a review April 30, 2026 10:17

ArafatKhan2198 added 2 commits May 1, 2026 15:33

Added a progress table for existing and completed exports

5970a5c

Added rejection of similar state jobs

a0f37c8

sumitagrawl reviewed May 4, 2026

View reviewed changes

devmadhuu reviewed May 4, 2026

View reviewed changes

ArafatKhan2198 added 3 commits May 5, 2026 00:40

Review changes set 1

b956f9c

Limit downloads per export job

e1d68fd

Added Unit Test and more OAreview comments

f2b4071

ArafatKhan2198 requested review from devmadhuu and sumitagrawl May 5, 2026 04:58

devmadhuu reviewed May 5, 2026

View reviewed changes

Fixed deadlock situations

cb0fbf0

ArafatKhan2198 requested a review from devmadhuu May 5, 2026 14:45

Fixed java doc

164a9d9

Conversation

ArafatKhan2198 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Solution: Asynchronous Background Export with Queue

Backend Changes

New: ExportJob model (ExportJob.java)

New: ExportJobManager.java — the core engine

ContainerEndpoint.java — new REST endpoints

ContainerHealthSchemaManager.java

ReconServerConfigKeys.java

Frontend Changes (containers.tsx, container.types.ts)

New: Export Tab (tab key '6')

How was this patch tested?

Uh oh!

devmadhuu commented Apr 30, 2026

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devmadhuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devmadhuu May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArafatKhan2198 commented May 5, 2026

Uh oh!

devmadhuu left a comment

ArafatKhan2198 commented Apr 30, 2026 •

edited

Loading

New: `ExportJob` model (`ExportJob.java`)

New: `ExportJobManager.java` — the core engine

`ContainerEndpoint.java` — new REST endpoints

`ContainerHealthSchemaManager.java`

`ReconServerConfigKeys.java`

Frontend Changes (`containers.tsx`, `container.types.ts`)

New: Export Tab (tab key `'6'`)

devmadhuu May 5, 2026 •

edited

Loading