Skip to content

GEP-1742: HTTPRoute Timeouts API #1997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jun 26, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 80 additions & 1 deletion geps/gep-1742.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first read this I wondered why it only focused on HTTPRoute, but later I realized the original GEP specified that focus. It would be a good idea to change the title to "HTTPRoute Timeouts API", or else mention in Non-Goals that other route types are not covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title changed

Original file line number Diff line number Diff line change
Expand Up @@ -301,8 +301,87 @@ Could not find any HTTP specific timeouts. PRs welcomed. 😊

## API

TBD.
From the above analysis, it appears that most implementations are capable of supporting
the configuration of simple downstream request timeouts on HTTPRoute rules. This is a
relatively small addition (one new field) that would benefit many users.

Because request timeouts are very much related to retry attempts, we also propose
addressing issue: [#1731](https://github.com/kubernetes-sigs/gateway-api/issues/1731)
by adding another field for retry conifguration in an HTTPRoute rule. This would also
benefit many users and seems to be implementable by many (most?) Gateway implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify if the client request covers the entire connection from client to backend, or just the downstream request from client to gateway? There is also at least one tunnel timeout (HAProxy) and perhaps some other timeouts that cover the entire connection, not just the downstream and upstream halves of the connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a diagram in the "Timeout values" section that shows the relationships, but you need to look at the rendered markdown to see it. https://github.com/kubernetes-sigs/gateway-api/blob/7e20c3d25368b2c91d0952dfbd842f46e3324b36/geps/gep-1742.md#timeout-values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see the diagram, yet still wanted to get the explanation in words and in the godoc of the implementation section for Request and BackendRequest. I asked about something similar in that section.

I don't see any timeout that would cover the entire transaction, unless Request does.

### GO

```go
type HTTPRouteRule struct {
// Timeout for HTTP requests, disabled by default.
Timeout *time.Duration `json:"timeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand the specification for this a bit? Given the GEP already documents probably 20 types of "timeout" in https://gateway-api.sigs.k8s.io/geps/gep-1742/?h=1742#flow-diagrams-with-available-timeouts, having just a single field as "timeout" with no further explanation seems a bit ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a paragraph above. PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a flat field, should this be tiered, so you can start off with a request timeout and add things like connect timeout in the future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkodg Good question. I think it depends on whether implementations support configuring other kinds of timeouts at the HTTPRoute level. Based on the diagrams in the previous sections, I think Envoy does not, but unclear about the others.

@youngnick do you have more insight on this?

Copy link
Contributor

@kflynn kflynn May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this GEP is actually defining two separate timeouts here... 🙂

  • HTTPRouteRule.timeout is defining an end-to-end timeout (a timeout for a single request as measured by the client), and
  • HTTPRouteRule.retries.perTryTimeout is defining a timeout for a single request as measured by the gateway rather than the client.

Both of these timeouts are useful! though, per the diagrams above, a given gateway may not be able to honor both.

Overall, I would prefer

HTTPRouteRule:
    timeouts:
        request: duration            # end-to-end as seen by the client
        workloadRequest: duration    # single request as seen by the gateway

with no perTryTimeout in the retry configuration. (And no, I'm not wedded to the names in the timeouts dict above.) I'd further suggest that the workloadRequest timeout not be part of the core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kflynn This suggestion sounds quite good to me. The reason that I dragged retries into this GEP in the first place is because perTryTimeout was so tightly coupled to retries, but moving it out and generalizing it like you suggest allows for better separation of concerns and to only focus on Timeouts in this GEP. I'm happy to make this change but will give others some time to comment first.

// Retry policy for HTTP requests.
Retries *HTTPRetry `json:"retries,omitempty"`
// ...
}

type HTTPRetry struct {
// Number of retries to be allowed for a given request.
Attempts int32 `json:"attempts,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have quite a few comments on retries, but I really think that the retry portion here needs to be pulled out into a different GEP. 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this

// Timeout per attempt for a given request, including the initial call and any retries.
// Default is the value of the HTTPRouteRule request Timeout.
PerTryTimeout *time.Duration `json:"per_try_timeout,omitempty"`
// Specifies the conditions under which retry takes place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What conditions are valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explain below that HTTP error codes are valid and must be supported by implementations. Other implementation-specific identifier are also valid, depending on the implementation.

  1. Does that sound reasonable?
  2. Do you think I need to explain it more clearly in the below text?

// One or more conditions can be specified using a ‘,’ delimited list.
RetryOn string `json:"retry_on,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a list it should be []string IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// Flag to specify whether the retries should retry to other localities.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems under-specified given there is no mention of "localities" anywhere else in the repo. Can we expand on this a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another section below.

RetryRemoteLocalities *bool `json:"retry_remote_localities,omitempty"`
}
```

### YAML

```yaml
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: timeout-example
spec:
...
rules:
- backendRefs:
- name: some-service
port: 8080
timeout: 10s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 503
```

### Timeout values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section should go before the API section, to further explain the timeouts that are being added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Timeout and PerTryTimeout values are formatted as 1h/1m/1s/1ms and MUST BE >= 1ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere else in gateway-api or core k8s? Is there a more formal spec?

is 1h5m valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any other use of duration in gateway-api. I couldn't find it in core k8s either, but maybe not looking in the right place? Otherwise, I think this is just the Go time.Duration string serialization that we'd use?


### Retry attempts

By default, the number of retry attempts is implementation dependent. An implementation MAY
provide external mechanisims for controlling the default retry behavior.

When a request Timeout in the HTTPRouteRule or PerTryTimeout value is configured, the actual
number of retries attempted may be reduced depending on the specified request Timeout and
PerTryTimeout values.

An implementation MUST ensure that the number of retries never exceeds the Attempts value,
if specified.

The interval between retries is implementation dependent but SHOULD be a minimum of 25ms.

### RetryOn Conditions

The value of an individual RetryOn condition can be either an HTTP status codes (e.g., 503)
or some other identifier known to the implementation.

An implementation MUST support any HTTP status code value for a RetryOn condition.

The default conditions that will trigger retries, if no RetryOn conditions are specified,
is implementation dependent.

## Alternatives

Expand Down