Change RetrieveCertificate so that it resets the enrollment (if it exists) while retrieving the certificate #269

maelvls · 2022-11-08T13:41:24Z

This PR implements the design presented in Solution 3: request, then reset if retrieve returns “Click Retry” or “WebSDK CertRequest”.

This PR fixes #239.

Update 29 Nov 2022: I put this PR back to "draft" because I am less and less confident with the changes I am making here. I propose to first agree on a set of RetrieveCertificate tests in #270 before I make any change to it RetrieveCertificate. When we have merged #270, we can proceed with the current PR.
Update 5 Dec 2022: I am now confident with the PR, it can now be reviewed.

The problem: When enrolling a new certificate, for example by running vcert enroll or when using cert-manager, people get "stuck" with the error message:

500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
Status: This certificate cannot be processed while it is in an error state. Fix any
errors, and then click Retry., Stage: 500.

or:

500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing,
Status: WebSDK CertRequest Module Requested Certificate, Stage: 500.

This message occurs when a past enrollment has failed or an enrollment was still in progress for that certificate. The current workaround is to call to POST /reset with Restart=False, and then re-run the command vcert enroll (or renew the certificate in cert-manager).

The Jenkins build 2378 is passing for 63f7dff.

From a user perspective, nothing changes: the OAuth token scope is the same as before (certificate:manage).

Self-review:

I went with a "mock" HTTP server due to the difficulty to get the 202 and 500 HTTP status codes "on demand" from TPP. I didn't add a "live test" because I don't know how to consistently force TPP to return a 500 without RDP'ing into the VM and putting a PowerShell script or turning off the Windows CA (it can't be a policy check such as the domain, as this would be a "Stage 0" error which is the only stage number which gets reset upon requesting a new certificate).
- Fake server: use a fake server to test it, but there is currently no fake server (note that @wallrj started to write a fake server in WIP: Start a fake TPP server and run the TPP tests against it #262). The fake server, as opposed to the mock server, contains some logic.
- Mock responses: use mock HTTP responses with a mock HTTP server.

Manual test performed:

Apart from the "mock" tests that I added, I also manually tested this change with a live TPP instance:

Reproducing the manual test

I spun up a TPP instance on CloudShare (using my Jetstack account), using the template "Master NGINX/K8s Blueprint" (snapshot "[6] training_2020-08-28") that comes with TPP 20.1.
I went to https://uvo1dq3vmkecfydcwxm.env.cloudshare.com/aperture/application-integrations/ and create a new API integration, since none of the existing contains configuration:manage (which I need to set the correct policy):

I created a token and stored it in the env var TOKEN with the following:

TPP_URL=https://uvo1dq3vmkecfydcwxm.env.cloudshare.com
TPP_USER=tppadmin
TPP_PASSWORD=
TOKEN=$(vcert getcred -u $TPP_URL --username $TPP_USER --password $TPP_PASSWORD --client-id=test --scope='certificate:manage;configuration:manage' --format json | jq -r .access_token)

Since we need to trigger an enrollment failure at a stage higher than 0, we will purposefully break the CA. To do that, I RDP'ed into the TPP VM, I opened the program certsrv, right-clicked on the only entry, "Manage CA...", and then clicked the "Stop" button:
I then enrolled:
```
vcert enroll -u "$TPP_URL" -t "$TOKEN" --cn example.com -z 'TLS/SSL' --san-dns=example.com
```
As expected, it shows:

vCert: 2022/11/14 14:58:38 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object., Stage: 500.
If we try to submit the certificate again, we still get the error at stage 500 (weirdly, the original disappeared and a generic message appears instead):

vCert: 2022/11/14 14:59:43 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.
I then fixed the CA:
I then tried to enroll again, and the "old" error with the sage 500 still shows:

vCert: 2022/11/14 15:06:30 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\TLS/SSL\example.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 500.

Now, from this project's folder, let us run vcert with the changes made in this PR:

git fetch origin refs/pull/269/head:reset-before-request
git checkout reset-before-request
go run ./cmd/vcert enroll -u "$TPP_URL" -t "$TOKEN" --cn example.com -z 'TLS/SSL' --san-dns=example.com

This time, the certificate gets properly requested.

Changes in error messages: this PR purposefully doesn't change any of the error messages.

Possible breakages to other projects:

✅ terraform-provider-venafi uses RetrieveCertificate in three locations. One of them matches on the following specific error message:
```
"unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 400 Failed to lookup private key, error: Failed to lookup private key vault id"
```
I added a test proving that this same error message will be shown in the same ciscumstances.
✅ vault-pki-backend-venafi won't be affected since it doesn't parses messages. It uses RetrieveCertificate once.
✅ cert-manager has a single call to RetrieveCertificate and doesn't look at the messages.

pkg/venafi/tpp/connector_test.go

inteon

Thank you for fixing the remaining issues.

pkg/venafi/tpp/connector_test.go

pkg/venafi/tpp/connector.go

pkg/venafi/tpp/connector_test.go

rvelaVenafi

There are too many unnecessary calls to reset if set unconditionally.
The logic to reset based on the request response is better and not more complex than this PR.
Just move this code to a separate function and validate the request response when statusCode == 500 and msg is the same as the expected JSON blob.
You can even wrap the current logic into another function with a boolean parameter

func (c *Connector) RequestCertificate(req *certificate.Request) (requestID) (requestID string, err error) {
        // first try with reset
        id, err := RequestCertificateWithReset(req, true)
        if err != nil && err == CertResetError{
                 // Second try with no reset
                 id, err := RequestCertificateWithReset(req, false)
        }
}

Then have the actual code be moved to a second internal function:

func (c *Connector) RequestCertificateWithReset(req *certificate.Request, reset bool) (requestID) (requestID string, err error) {
    // Current RequestCertificate logic here
   
   	requestID, err = parseRequestResult(statusCode, status, body)
	if err != nil {
                 if reset {
		        //invoke reset logic here
                         err = resetCertificateResource(certDN, false)
                         if err != nil {
                                 // error that happened during reset 
                                 return "", err
                         }
                         // This error indicates the caller function that request failed because of 500 and a reset was attempted
                         return "", CertResetError       
                 }
                 // return normal error
                 return "", err 
	}
}

This way we can minimize the number of extra calls required for this use case

maelvls · 2022-11-21T09:59:06Z

The logic to reset based on the request response is better and not more complex than this PR.

I forgot to mention in the PR description that it is not possible to use POST /request as a mean to know whether this certificate should be reset.

The reason it is not possible is that POST /request always succeeds as long as the given CSR or certificate parameters are valid. Calling POST /request doesn't allow you to know whether POST /reset needs to be called or not. I have elaborated a bit on that in #269 (comment).

maelvls · 2022-11-24T11:33:19Z

I finished implementing Solution 3: request, then reset if retrieve returns “Click Retry” or “WebSDK CertRequest”. Please take another look.

pkg/venafi/tpp/connector.go

When enrolling a new certificate, for example by submitting a user-provided CSR, people using cert-manager and other tools that automatically enroll certificates would get "stuck" with the error of the like: 500 Certificate \VED\Policy\Test\foo.com has encountered an error while processing, Status: This certificate cannot be processed while it is in an error state. Fix any errors, and then click Retry., Stage: 700. (the "stage" number doesn't matter) Running "vcert enroll" or forcing the re-issuance in cert-manager would not have the expected effect: the same error would show up over and over. The initial idea was to do a systematic "reset" before requesting, but it was deemed too costly, as it would mean one more HTTP call on the happy path of "vcert enroll". The proposed solution/workaround is to do that in the RetrieveCertificate func. It isn't ideal, but it should not affect people since /retrieve and /reset share the same OAuth scope.

luispresuelVenafi · 2022-11-18T22:48:25Z

pkg/venafi/tpp/connector.go

+	// | 400 {"Error":"CertificateDN error..."}  | Certificate DN not found   |
+	// | 400 {"Error":"No reset is required..."} | No enrollment to be reset  |
+	certDN := getCertificateDN(c.zone, req.Subject.CommonName)
+	statusCode, status, body, err := c.request("POST", urlResourceCertificateReset, struct {


@maelvls Why are resetting the certificate before enrollment and during the same workflow? As it is, it would make unnecessary call to TPP. It would be better as, what other team members already suggested, to add this command in an error logic, so it doesn't interrupt the workflow and only to make the call if the certificate that have errored by being in an bad status; e.g.:
pseudo code:

statuscode, err = request if err and status = 500 { resetCertiicate() // reset certificate only on failure }

(from #269 (comment)) It is not possible to use POST /request as a mean to know whether this certificate should be reset. The reason it is not possible is that POST /request always succeeds as long as the given CSR or certificate parameters are valid. Calling POST /request doesn't allow you to know whether POST /reset needs to be called or not. I have elaborated a bit on this in #269 (comment).

pkg/venafi/tpp/connector.go

maelvls · 2023-01-02T12:06:36Z

This PR made it into vcert v4.23.0. We can now close #239.

Update: this change was reverted in 4.24.0.

This reverts commit 240fe8f, reversing changes made to e607e30.

In Venafi#269, I got convinced that it would be a good solution to call "Reset" only if the "Retrieve" call returned a known message. Later on, we realized that there was a bad interaction between "Request" and "Reset(restart=true)". For some reason, when a problem arises (such as CA being down), TPP returns the old certificate, and vcert ends up showing the message "unmatched key modulus". We realized that calling "Reset(restart=false)" before Request prevents this bug. Although that's one extra HTTP call, it seems this call is very inexpensive. One downside that was brought up during the PR Venafi#269 was that any extra HTTP call would slow the TPP server because the HTTP called are "queued" (not concurrently processed).

This reverts commit 240fe8f, reversing changes made to e607e30.

Revert "Merge pull request #269 from maelvls/reset-before-request"

This reverts commit 240fe8f, reversing changes made to e607e30.

In Venafi#269, I got convinced that it would be a good solution to call "Reset" only if the "Retrieve" call returned a known message. Later on, we realized that there was a bad interaction between "Request" and "Reset(restart=true)". For some reason, when a problem arises (such as CA being down), TPP returns the old certificate, and vcert ends up showing the message "unmatched key modulus". We realized that calling "Reset(restart=false)" before Request prevents this bug. Although that's one extra HTTP call, it seems this call is very inexpensive. One downside that was brought up during the PR Venafi#269 was that any extra HTTP call would slow the TPP server because the HTTP called are "queued" (not concurrently processed).

maelvls · 2023-05-23T12:57:34Z

This improvement was reverted in 4.24.0 (and later versions). You can read #273 (comment) to learn more.

maelvls requested review from EduardoVV, luispresuelVenafi, marcos-albornoz and rvelaVenafi as code owners November 8, 2022 13:41

maelvls force-pushed the reset-before-request branch 2 times, most recently from e835061 to a4172b9 Compare November 10, 2022 11:53

inteon reviewed Nov 10, 2022

View reviewed changes

pkg/venafi/tpp/connector_test.go Show resolved Hide resolved

pkg/venafi/tpp/connector_test.go Outdated Show resolved Hide resolved

maelvls force-pushed the reset-before-request branch 5 times, most recently from 5b65301 to 1513047 Compare November 10, 2022 17:02

maelvls marked this pull request as draft November 15, 2022 09:05

maelvls force-pushed the reset-before-request branch from d521fe2 to cb8e5ee Compare November 15, 2022 10:39

maelvls marked this pull request as ready for review November 15, 2022 10:45

inteon approved these changes Nov 16, 2022

View reviewed changes

maelvls commented Nov 18, 2022

View reviewed changes

pkg/venafi/tpp/connector_test.go Outdated Show resolved Hide resolved

JigarAtVenafi requested changes Nov 18, 2022

View reviewed changes

pkg/venafi/tpp/connector.go Outdated Show resolved Hide resolved

pkg/venafi/tpp/connector_test.go Show resolved Hide resolved

rvelaVenafi requested changes Nov 18, 2022

View reviewed changes

maelvls force-pushed the reset-before-request branch from bcbe7c5 to 539fa7d Compare November 24, 2022 11:34

inteon reviewed Nov 24, 2022

View reviewed changes

pkg/venafi/tpp/connector.go Show resolved Hide resolved

inteon reviewed Nov 24, 2022

View reviewed changes

pkg/venafi/tpp/connector.go Outdated Show resolved Hide resolved

inteon reviewed Nov 24, 2022

View reviewed changes

pkg/venafi/tpp/connector.go Outdated Show resolved Hide resolved

maelvls force-pushed the reset-before-request branch 5 times, most recently from d6dbde9 to 0f35f33 Compare November 25, 2022 09:21

maelvls force-pushed the reset-before-request branch 2 times, most recently from 1fe5eec to 58d96cc Compare December 8, 2022 11:47

marcos-albornoz approved these changes Dec 9, 2022

View reviewed changes

maelvls force-pushed the reset-before-request branch from 58d96cc to 9e6e119 Compare December 9, 2022 16:20

luispresuelVenafi approved these changes Dec 9, 2022

View reviewed changes

JigarAtVenafi reviewed Dec 14, 2022

View reviewed changes

pkg/venafi/tpp/connector.go Show resolved Hide resolved

JigarAtVenafi approved these changes Dec 14, 2022

View reviewed changes

maelvls mentioned this pull request Dec 16, 2022

Need explanation of error messages Venafi/vault-pki-backend-venafi#96

Open

rvelaVenafi approved these changes Dec 16, 2022

View reviewed changes

luispresuelVenafi merged commit 240fe8f into Venafi:master Dec 16, 2022

This was referenced Jan 2, 2023

Provide the ability to recover a Certificate request from an Error state cert-manager/cert-manager#5274

Closed

Provide the ability to reset the certificate object in Venafi TPP #239

Open

maelvls deleted the reset-before-request branch January 2, 2023 12:06

inteon added a commit to jetstack/vcert that referenced this pull request Jan 20, 2023

Revert "Merge pull request Venafi#269 from maelvls/reset-before-request"

cb1e872

This reverts commit 240fe8f, reversing changes made to e607e30.

maelvls mentioned this pull request Jan 26, 2023

Reset before request in order to work around VCert's "unmatched key modulus" jetstack/vcert#3

Merged

inteon added a commit to inteon/vcert that referenced this pull request Feb 8, 2023

Revert "Merge pull request Venafi#269 from maelvls/reset-before-request"

85ea317

This reverts commit 240fe8f, reversing changes made to e607e30.

inteon mentioned this pull request Feb 8, 2023

Revert "Merge pull request #269 from maelvls/reset-before-request" #274

Merged

luispresuelVenafi added a commit that referenced this pull request Feb 9, 2023

Merge pull request #274 from inteon/revert_reset

df2a331

Revert "Merge pull request #269 from maelvls/reset-before-request"

eyalle pushed a commit to eyalle/vcert that referenced this pull request Feb 13, 2023

Revert "Merge pull request Venafi#269 from maelvls/reset-before-request"

c93f176

This reverts commit 240fe8f, reversing changes made to e607e30.

eyalle pushed a commit to eyalle/vcert that referenced this pull request Feb 13, 2023

Revert "Merge pull request Venafi#269 from maelvls/reset-before-request"

14cea78

This reverts commit 240fe8f, reversing changes made to e607e30.

eyalle pushed a commit to eyalle/vcert that referenced this pull request Feb 13, 2023

Revert "Merge pull request Venafi#269 from maelvls/reset-before-request"

88841df

This reverts commit 240fe8f, reversing changes made to e607e30.

inteon mentioned this pull request Feb 16, 2023

Use jetstack vcert fork to properly reset on TPP error cert-manager/cert-manager#5805

Merged

maelvls mentioned this pull request Feb 17, 2023

VCert's "auto-retry" feature (i.e., reset certificate if it is failed) causes a race condition in TPP, resulting in the error "unmatched key modulus" #273

Closed

Change RetrieveCertificate so that it resets the enrollment (if it exists) while retrieving the certificate #269

Change RetrieveCertificate so that it resets the enrollment (if it exists) while retrieving the certificate #269

Uh oh!

Conversation

maelvls commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

inteon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rvelaVenafi left a comment

Choose a reason for hiding this comment

Uh oh!

maelvls commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maelvls commented Nov 24, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

luispresuelVenafi Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

maelvls Dec 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maelvls commented Jan 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maelvls commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

maelvls commented Nov 8, 2022 •

edited

Loading

maelvls commented Nov 21, 2022 •

edited

Loading

maelvls commented Jan 2, 2023 •

edited

Loading

maelvls commented May 23, 2023 •

edited

Loading