WIP: hdfs: add Kerberos authentication support#4419
Conversation
|
I would like to know how I trigger the review process. |
Jeffail
left a comment
There was a problem hiding this comment.
Thank you for the contribution, @Khalid-Nowaf — Kerberos support here has been long awaited. I spent some time tracing through the diff and the upstream libraries; the notes below are intended as collaborative feedback.
Kerberos client lifecycle (internal/impl/hdfs/config.go + input.go/output.go Close)
Login() inside hdfsKerberosConfig.client() spawns a background renewal goroutine via gokrb5's enableAutoSessionRenewal (client/session.go). Because the client is constructed with NewWithKeytab, the TGT will refresh indefinitely for the lifetime of the *krbclient.Client — so long-running pipelines should authenticate without issue, which is great.
However, neither hdfsReader.Close nor hdfsWriter.Close currently calls Destroy() on the *krbclient.Client (which stops the renewal goroutine, drops the session, and zeroes credential material), nor closes the underlying *hdfs.Client. For a process that brings the component up once and runs forever this is harmless, but on component rebuild (config reload, fatal-error retry) the previous instance's renewal goroutine retains a reference to the Client and the in-memory keytab, preventing GC. Stashing the Kerberos client on the reader/writer struct and calling Destroy() (plus closing the HDFS client) in Close would resolve both.
Dead branch in validateInput (config.go)
hdfsConfigFromParsed already invokes c.kerberos.validate() before validateInput runs, and validate() rejects any data_transfer_protection value outside {\"\", \"authentication\", \"integrity\", \"privacy\"}. The default: arm of the validateInput switch is therefore unreachable and could be removed for clarity.
`user` field semantics under Kerberos
In hdfs/v2, when KerberosClient is set, User becomes the effective/proxy user rather than the authenticating identity (which is derived from the principal). The current field description — "A user ID to connect as." — is accurate for the non-Kerberos case but potentially misleading once Kerberos is enabled. A brief note in the description would help operators understand the interaction.
Origin of the input integrity/privacy restriction
Would you mind clarifying the source of the "integrity"/"privacy" limitation for the HDFS input? Knowing whether this stems from colinmarc/hdfs/v2, a deliberate Connect-side decision, or a gap to be closed later would be useful for operators — and a short reference in the validation error message itself would aid debugging in the field.
|
Thanks @Jeffail for the comprehensive review. I’d like to provide more context on the origin of the input integrity/privacy restriction, before jumping to refactor the code based on your review. Before working on this, I set up a local HDFS cluster with Kerberos enabled and created local integration tests (Im using macOS, and the current integration tests does not support my OS): Local Integration Test Setup. Reproduce Failure Script Once the environment was ready, the following test cases failed:
This does not appear to be a Kerberos authentication issue or a problem with the HDFS setup. Instead, it looks like an issue in the upstream Go HDFS client’s DataNode read path when the stream is wrapped with either integrity or privacy protection. I started debugging the upstream HDFS client, but it required more time than I initially expected, so I treated this as an upstream bug for now. For that reason, I block these two modes only for the HDFS input path to avoid exposing a configuration that I know will fail. But still allow them for HDFS output because output operations were tested locally and worked correctly, including with privacy. |
|
I think the issue is with my environment or local test setup. I’ll try using the current integration tests on a Linux machine and add Kerberos (krb5) integration tests as well. I’ll get back once I’ve tested it again. |
Adds Kerberos authentication support for the HDFS connector by upgrading to github.com/colinmarc/hdfs/v2 and introducing shared HDFS auth configuration for input and output.
Changes
Related Issues
#1347