Skip to content

Conversation

@maelvls
Copy link
Contributor

@maelvls maelvls commented Nov 8, 2022

This PR implements the design presented in Solution 3: request, then reset if retrieve returns “Click Retry” or “WebSDK CertRequest”.

This PR fixes #239.

Update 29 Nov 2022: I put this PR back to "draft" because I am less and less confident with the changes I am making here. I propose to first agree on a set of RetrieveCertificate tests in #270 before I make any change to it RetrieveCertificate. When we have merged #270, we can proceed with the current PR.
Update 5 Dec 2022: I am now confident with the PR, it can now be reviewed.


The problem: When enrolling a new certificate, for example by running vcert enroll or when using cert-manager, people get "stuck" with the error message:

500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
Status: This certificate cannot be processed while it is in an error state. Fix any
errors, and then click Retry., Stage: 500.

or:

500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
Status: WebSDK CertRequest Module Requested Certificate, Stage: 500.

This message occurs when a past enrollment has failed or an enrollment was still in progress for that certificate. The current workaround is to call to POST /reset with Restart=False, and then re-run the command vcert enroll (or renew the certificate in cert-manager).

The Jenkins build 2378 is passing for 63f7dff.

From a user perspective, nothing changes: the OAuth token scope is the same as before (certificate:manage).

Self-review:

  • I went with a "mock" HTTP server due to the difficulty to get the 202 and 500 HTTP status codes "on demand" from TPP. I didn't add a "live test" because I don't know how to consistently force TPP to return a 500 without RDP'ing into the VM and putting a PowerShell script or turning off the Windows CA (it can't be a policy check such as the domain, as this would be a "Stage 0" error which is the only stage number which gets reset upon requesting a new certificate).

Manual test performed:

Apart from the "mock" tests that I added, I also manually tested this change with a live TPP instance:

Reproducing the manual test
  1. I spun up a TPP instance on CloudShare (using my Jetstack account), using the template "Master NGINX/K8s Blueprint" (snapshot "[6] training_2020-08-28") that comes with TPP 20.1.

  2. I went to https://uvo1dq3vmkecfydcwxm.env.cloudshare.com/aperture/application-integrations/ and create a new API integration, since none of the existing contains configuration:manage (which I need to set the correct policy):

  3. I created a token and stored it in the env var TOKEN with the following:

    TPP_URL=https://uvo1dq3vmkecfydcwxm.env.cloudshare.com
    TPP_USER=tppadmin
    TPP_PASSWORD=
    TOKEN=$(vcert getcred -u $TPP_URL --username $TPP_USER --password $TPP_PASSWORD --client-id=test --scope='certificate:manage;configuration:manage' --format json | jq -r .access_token)
  4. Since we need to trigger an enrollment failure at a stage higher than 0, we will purposefully break the CA. To do that, I RDP'ed into the TPP VM, I opened the program certsrv, right-clicked on the only entry, "Manage CA...", and then clicked the "Stop" button:

  5. I then enrolled:

    vcert enroll -u "$TPP_URL" -t "$TOKEN" --cn example.com -z 'TLS/SSL' --san-dns=example.com

    As expected, it shows:

    vCert: 2022/11/14 14:58:38 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object., Stage: 500.

  6. If we try to submit the certificate again, we still get the error at stage 500 (weirdly, the original disappeared and a generic message appears instead):

    vCert: 2022/11/14 14:59:43 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.

  7. I then fixed the CA:

  8. I then tried to enroll again, and the "old" error with the sage 500 still shows:

    vCert: 2022/11/14 15:06:30 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.

  9. Now, from this project's folder, let us run vcert with the changes made in this PR:

    git fetch origin refs/pull/269/head:reset-before-request
    git checkout reset-before-request
    go run ./cmd/vcert enroll -u "$TPP_URL" -t "$TOKEN" --cn example.com -z 'TLS/SSL' --san-dns=example.com

    This time, the certificate gets properly requested.

Changes in error messages: this PR purposefully doesn't change any of the error messages.

Possible breakages to other projects:

  • terraform-provider-venafi uses RetrieveCertificate in three locations. One of them matches on the following specific error message:
    "unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 400 Failed to lookup private key, error: Failed to lookup private key vault id"
    I added a test proving that this same error message will be shown in the same ciscumstances.
  • vault-pki-backend-venafi won't be affected since it doesn't parses messages. It uses RetrieveCertificate once.
  • cert-manager has a single call to RetrieveCertificate and doesn't look at the messages.

@maelvls maelvls force-pushed the reset-before-request branch 2 times, most recently from e835061 to a4172b9 Compare November 10, 2022 11:53
@maelvls maelvls force-pushed the reset-before-request branch 5 times, most recently from 5b65301 to 1513047 Compare November 10, 2022 17:02
@maelvls maelvls marked this pull request as draft November 15, 2022 09:05
@maelvls maelvls force-pushed the reset-before-request branch from d521fe2 to cb8e5ee Compare November 15, 2022 10:39
@maelvls maelvls marked this pull request as ready for review November 15, 2022 10:45
Copy link
Contributor

@inteon inteon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing the remaining issues.

Copy link
Contributor

@rvelaVenafi rvelaVenafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many unnecessary calls to reset if set unconditionally.
The logic to reset based on the request response is better and not more complex than this PR.
Just move this code to a separate function and validate the request response when statusCode == 500 and msg is the same as the expected JSON blob.
You can even wrap the current logic into another function with a boolean parameter

func (c *Connector) RequestCertificate(req *certificate.Request) (requestID) (requestID string, err error) {
        // first try with reset
        id, err := RequestCertificateWithReset(req, true)
        if err != nil && err == CertResetError{
                 // Second try with no reset
                 id, err := RequestCertificateWithReset(req, false)
        }
}

Then have the actual code be moved to a second internal function:

func (c *Connector) RequestCertificateWithReset(req *certificate.Request, reset bool) (requestID) (requestID string, err error) {
    // Current RequestCertificate logic here
   
   	requestID, err = parseRequestResult(statusCode, status, body)
	if err != nil {
                 if reset {
		        //invoke reset logic here
                         err = resetCertificateResource(certDN, false)
                         if err != nil {
                                 // error that happened during reset 
                                 return "", err
                         }
                         // This error indicates the caller function that request failed because of 500 and a reset was attempted
                         return "", CertResetError       
                 }
                 // return normal error
                 return "", err 
	}
}

This way we can minimize the number of extra calls required for this use case

@maelvls
Copy link
Contributor Author

maelvls commented Nov 21, 2022

The logic to reset based on the request response is better and not more complex than this PR.

I forgot to mention in the PR description that it is not possible to use POST /request as a mean to know whether this certificate should be reset.

The reason it is not possible is that POST /request always succeeds as long as the given CSR or certificate parameters are valid. Calling POST /request doesn't allow you to know whether POST /reset needs to be called or not. I have elaborated a bit on that in #269 (comment).

@maelvls
Copy link
Contributor Author

maelvls commented Nov 24, 2022

I finished implementing Solution 3: request, then reset if retrieve returns “Click Retry” or “WebSDK CertRequest”. Please take another look.

@maelvls maelvls force-pushed the reset-before-request branch from bcbe7c5 to 539fa7d Compare November 24, 2022 11:34
@maelvls maelvls force-pushed the reset-before-request branch 5 times, most recently from d6dbde9 to 0f35f33 Compare November 25, 2022 09:21
@maelvls maelvls force-pushed the reset-before-request branch 2 times, most recently from 1fe5eec to 58d96cc Compare December 8, 2022 11:47
When enrolling a new certificate, for example by submitting a
user-provided CSR, people using cert-manager and other tools that
automatically enroll certificates would get "stuck" with the error of
the like:

    500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
    Status: This certificate cannot be processed while it is in an error state. Fix any
    errors, and then click Retry., Stage: 700.

(the "stage" number doesn't matter)

Running "vcert enroll" or forcing the re-issuance in cert-manager would
not have the expected effect: the same error would show up over and
over.

The initial idea was to do a systematic "reset" before requesting, but
it was deemed too costly, as it would mean one more HTTP call on the
happy path of "vcert enroll".

The proposed solution/workaround is to do that in the
RetrieveCertificate func. It isn't ideal, but it should not affect
people since /retrieve and /reset share the same OAuth scope.
@maelvls maelvls force-pushed the reset-before-request branch from 58d96cc to 9e6e119 Compare December 9, 2022 16:20
// | 400 {"Error":"CertificateDN error..."} | Certificate DN not found |
// | 400 {"Error":"No reset is required..."} | No enrollment to be reset |
certDN := getCertificateDN(c.zone, req.Subject.CommonName)
statusCode, status, body, err := c.request("POST", urlResourceCertificateReset, struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maelvls Why are resetting the certificate before enrollment and during the same workflow? As it is, it would make unnecessary call to TPP. It would be better as, what other team members already suggested, to add this command in an error logic, so it doesn't interrupt the workflow and only to make the call if the certificate that have errored by being in an bad status; e.g.:
pseudo code:

statuscode, err = request
if err and status = 500 {
 resetCertiicate() // reset certificate only on failure
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(from #269 (comment)) It is not possible to use POST /request as a mean to know whether this certificate should be reset. The reason it is not possible is that POST /request always succeeds as long as the given CSR or certificate parameters are valid. Calling POST /request doesn't allow you to know whether POST /reset needs to be called or not. I have elaborated a bit on this in #269 (comment).

@maelvls
Copy link
Contributor Author

maelvls commented Jan 2, 2023

This PR made it into vcert v4.23.0. We can now close #239.

Update: this change was reverted in 4.24.0.

@maelvls maelvls deleted the reset-before-request branch January 2, 2023 12:06
inteon added a commit to jetstack/vcert that referenced this pull request Jan 20, 2023
maelvls added a commit to maelvls/vcert that referenced this pull request Jan 20, 2023
In Venafi#269, I got convinced that it
would be a good solution to call "Reset" only if the "Retrieve" call
returned a known message.

Later on, we realized that there was a bad interaction between "Request"
and "Reset(restart=true)". For some reason, when a problem arises (such
as CA being down), TPP returns the old certificate, and vcert ends up
showing the message "unmatched key modulus".

We realized that calling "Reset(restart=false)" before Request prevents
this bug. Although that's one extra HTTP call, it seems this call is
very inexpensive. One downside that was brought up during the PR Venafi#269
was that any extra HTTP call would slow the TPP server because the HTTP
called are "queued" (not concurrently processed).
maelvls added a commit to maelvls/vcert that referenced this pull request Jan 24, 2023
In Venafi#269, I got convinced that it
would be a good solution to call "Reset" only if the "Retrieve" call
returned a known message.

Later on, we realized that there was a bad interaction between "Request"
and "Reset(restart=true)". For some reason, when a problem arises (such
as CA being down), TPP returns the old certificate, and vcert ends up
showing the message "unmatched key modulus".

We realized that calling "Reset(restart=false)" before Request prevents
this bug. Although that's one extra HTTP call, it seems this call is
very inexpensive. One downside that was brought up during the PR Venafi#269
was that any extra HTTP call would slow the TPP server because the HTTP
called are "queued" (not concurrently processed).
maelvls added a commit to maelvls/vcert that referenced this pull request Jan 24, 2023
In Venafi#269, I got convinced that it
would be a good solution to call "Reset" only if the "Retrieve" call
returned a known message.

Later on, we realized that there was a bad interaction between "Request"
and "Reset(restart=true)". For some reason, when a problem arises (such
as CA being down), TPP returns the old certificate, and vcert ends up
showing the message "unmatched key modulus".

We realized that calling "Reset(restart=false)" before Request prevents
this bug. Although that's one extra HTTP call, it seems this call is
very inexpensive. One downside that was brought up during the PR Venafi#269
was that any extra HTTP call would slow the TPP server because the HTTP
called are "queued" (not concurrently processed).
inteon added a commit to inteon/vcert that referenced this pull request Feb 8, 2023
luispresuelVenafi added a commit that referenced this pull request Feb 9, 2023
Revert "Merge pull request #269 from maelvls/reset-before-request"
eyalle pushed a commit to eyalle/vcert that referenced this pull request Feb 13, 2023
eyalle pushed a commit to eyalle/vcert that referenced this pull request Feb 13, 2023
eyalle pushed a commit to eyalle/vcert that referenced this pull request Feb 13, 2023
inteon added a commit to jetstack/vcert that referenced this pull request Feb 15, 2023
In Venafi#269, I got convinced that it
would be a good solution to call "Reset" only if the "Retrieve" call
returned a known message.

Later on, we realized that there was a bad interaction between "Request"
and "Reset(restart=true)". For some reason, when a problem arises (such
as CA being down), TPP returns the old certificate, and vcert ends up
showing the message "unmatched key modulus".

We realized that calling "Reset(restart=false)" before Request prevents
this bug. Although that's one extra HTTP call, it seems this call is
very inexpensive. One downside that was brought up during the PR Venafi#269
was that any extra HTTP call would slow the TPP server because the HTTP
called are "queued" (not concurrently processed).
@maelvls
Copy link
Contributor Author

maelvls commented May 23, 2023

This improvement was reverted in 4.24.0 (and later versions). You can read #273 (comment) to learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide the ability to reset the certificate object in Venafi TPP

6 participants