|
1 | | -# Cachew (pronounced cashew) is a super-fast pass-through cache |
| 1 | +# Cachew |
2 | 2 |
|
3 | | -Cachew is a server and tooling for incredibly efficient, protocol-aware caching. It is |
4 | | -designed to be used at scale, with minimal impact on upstream systems. By "protocol-aware", we mean that the proxy isn't |
5 | | -just a naive HTTP proxy, it is aware of the higher level protocol being proxied (Git, Docker, etc.) and can make more efficient decisions. |
| 3 | +Cachew (pronounced "cashew") is a tiered, protocol-aware, caching HTTP proxy for software engineering infrastructure. It understands higher-level protocols (Git, Docker, Go modules, etc.) and makes smarter caching decisions than a naive HTTP proxy. |
6 | 4 |
|
7 | | -## Git |
| 5 | +## Strategies |
8 | 6 |
|
9 | | -Git causes a number of problems for us, but the most obvious are: |
| 7 | +### Git |
10 | 8 |
|
11 | | -1. Rate limiting by service providers. |
12 | | -2. `git clone` is very slow, even discounting network overhead |
| 9 | +Caches Git repositories with two complementary techniques: |
13 | 10 |
|
14 | | -To solve this we apply two different strategies on the server: |
| 11 | +1. **Snapshots** — periodic `.tar.zst` archives that restore 4–5x faster than `git clone`. |
| 12 | +2. **Pack caching** — passthrough caching of packs from `git-upload-pack` for incremental pulls. |
15 | 13 |
|
16 | | -1. Periodic full `.tar.zst` snapshots of the repository. These snapshots restore 4-5x faster than `git clone`. |
17 | | -2. Passthrough caching of the packs returned by `POST /repo.git/git-upload-pack` to support incremental pulls. |
18 | | - |
19 | | -On the client we redirect git to the proxy: |
| 14 | +Redirect Git traffic through cachew: |
20 | 15 |
|
21 | 16 | ```ini |
22 | | -[url "https://cachew.local/github/"] |
| 17 | +[url "https://cachew.example.com/git/github.com/"] |
23 | 18 | insteadOf = https://github.com/ |
24 | 19 | ``` |
25 | 20 |
|
26 | | -As Git itself isn't aware of the snapshots, Git-specific code in the Cachew CLI can be used to reconstruct a repository. |
| 21 | +Restore a repository from a snapshot (with automatic delta bundle to reach HEAD): |
| 22 | + |
| 23 | +```sh |
| 24 | +cachew git restore https://github.com/org/repo ./repo |
| 25 | +``` |
| 26 | + |
| 27 | +```hcl |
| 28 | +git { |
| 29 | + snapshot-interval = "1h" |
| 30 | + repack-interval = "1h" |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +### GitHub Releases |
| 35 | + |
| 36 | +Caches public and private GitHub release assets. Private orgs use a token or GitHub App for authentication. |
| 37 | + |
| 38 | +**URL pattern:** `/github-releases/{owner}/{repo}/{tag}/{asset}` |
| 39 | + |
| 40 | +```hcl |
| 41 | +github-releases { |
| 42 | + token = "${GITHUB_TOKEN}" |
| 43 | + private-orgs = ["myorg"] |
| 44 | +} |
| 45 | +``` |
| 46 | + |
| 47 | +### Go Modules |
| 48 | + |
| 49 | +Go module proxy (`GOPROXY`-compatible). Private modules are fetched via git clone. |
| 50 | + |
| 51 | +**URL pattern:** `/gomod/...` |
| 52 | + |
| 53 | +```sh |
| 54 | +export GOPROXY=http://cachew.example.com/gomod,direct |
| 55 | +``` |
| 56 | + |
| 57 | +```hcl |
| 58 | +gomod { |
| 59 | + proxy = "https://proxy.golang.org" |
| 60 | + private-paths = ["github.com/myorg/*"] |
| 61 | +} |
| 62 | +``` |
| 63 | + |
| 64 | +### Hermit |
| 65 | + |
| 66 | +Caches [Hermit](https://cashapp.github.io/hermit/) package downloads. GitHub release URLs are automatically routed through the `github-releases` strategy. |
| 67 | + |
| 68 | +**URL pattern:** `/hermit/{host}/{path...}` |
| 69 | + |
| 70 | +```hcl |
| 71 | +hermit {} |
| 72 | +``` |
| 73 | + |
| 74 | +### Artifactory |
| 75 | + |
| 76 | +Caches artifacts from JFrog Artifactory with host-based or path-based routing. |
| 77 | + |
| 78 | +```hcl |
| 79 | +artifactory "example.jfrog.io" { |
| 80 | + target = "https://example.jfrog.io" |
| 81 | +} |
| 82 | +``` |
| 83 | + |
| 84 | +### Host |
| 85 | + |
| 86 | +Generic reverse-proxy caching for arbitrary HTTP hosts, with optional custom headers. |
| 87 | + |
| 88 | +```hcl |
| 89 | +host "https://ghcr.io" { |
| 90 | + headers = { |
| 91 | + "Authorization": "Bearer QQ==" |
| 92 | + } |
| 93 | +} |
| 94 | +
|
| 95 | +host "https://w3.org" {} |
| 96 | +``` |
| 97 | + |
| 98 | +### HTTP Proxy |
| 99 | + |
| 100 | +Caching proxy for clients that use absolute-form HTTP requests (e.g. Android `sdkmanager --proxy_host`). |
| 101 | + |
| 102 | +```hcl |
| 103 | +proxy {} |
| 104 | +``` |
| 105 | + |
| 106 | +## Cache Backends |
| 107 | + |
| 108 | +Multiple backends can be configured simultaneously — they are automatically combined into a tiered cache. Reads check each tier in order and backfill lower tiers on a hit. Writes go to all tiers in parallel. |
| 109 | + |
| 110 | +### Memory |
| 111 | + |
| 112 | +In-memory LRU cache. |
| 113 | + |
| 114 | +```hcl |
| 115 | +memory { |
| 116 | + limit-mb = 1024 # default |
| 117 | + max-ttl = "1h" # default |
| 118 | +} |
| 119 | +``` |
| 120 | + |
| 121 | +### Disk |
| 122 | + |
| 123 | +On-disk LRU cache with TTL-based eviction. |
| 124 | + |
| 125 | +```hcl |
| 126 | +disk { |
| 127 | + limit-mb = 250000 |
| 128 | + max-ttl = "8h" |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +### S3 |
| 133 | + |
| 134 | +S3-compatible object storage (AWS S3, MinIO, etc.). |
| 135 | + |
| 136 | +```hcl |
| 137 | +s3 { |
| 138 | + bucket = "my-cache-bucket" |
| 139 | + endpoint = "s3.amazonaws.com" |
| 140 | + region = "us-east-1" |
| 141 | +} |
| 142 | +``` |
27 | 143 |
|
28 | 144 | ## Authorization (OPA) |
29 | 145 |
|
30 | | -Cachew uses [Open Policy Agent](https://www.openpolicyagent.org/) (OPA) for request authorization. A default policy is |
31 | | -always active even without any configuration, allowing any request from 127.0.0.1 and `GET` and `HEAD` requests from |
32 | | -elsewhere. |
| 146 | +Cachew uses [Open Policy Agent](https://www.openpolicyagent.org/) for request authorization. The default policy allows all methods from `127.0.0.1` and `GET`/`HEAD` from elsewhere. |
33 | 147 |
|
34 | | -To customise the policy, add an `opa` block to your configuration with either an inline policy or a path to a `.rego` file: |
| 148 | +Policies must be in `package cachew.authz` and define a `deny` rule set. If the set is empty, the request is allowed; otherwise the reasons are returned to the client. |
35 | 149 |
|
36 | 150 | ```hcl |
37 | | -# Inline policy |
38 | 151 | opa { |
39 | 152 | policy = <<EOF |
40 | 153 | package cachew.authz |
41 | | - default allow := false |
42 | | - allow if input.method == "GET" |
43 | | - allow if input.method == "HEAD" |
44 | | - allow if { input.method == "POST"; input.path[0] == "api" } |
| 154 | + deny contains "unauthenticated" if not input.headers["authorization"] |
| 155 | + deny contains "writes not allowed" if input.method == "PUT" |
45 | 156 | EOF |
46 | 157 | } |
| 158 | +``` |
47 | 159 |
|
48 | | -# Or reference an external file |
| 160 | +Or reference an external file with optional data: |
| 161 | + |
| 162 | +```hcl |
49 | 163 | opa { |
50 | 164 | policy-file = "./policy.rego" |
| 165 | + data-file = "./opa-data.json" |
51 | 166 | } |
52 | 167 | ``` |
53 | 168 |
|
54 | | -Policies must be written under `package cachew.authz` and define a `deny` rule that collects human-readable reason strings. If the deny set is empty the request is allowed; otherwise it is rejected and the reasons are included in the response body and server logs. The input document available to policies contains: |
| 169 | +**Input fields:** `input.method`, `input.path` (string array), `input.headers`, `input.remote_addr` (includes port — use `startswith` to match by IP). |
55 | 170 |
|
56 | | -| Field | Type | Description | |
57 | | -|---|---|---| |
58 | | -| `input.method` | string | HTTP method (GET, POST, etc.) | |
59 | | -| `input.path` | []string | URL path split by `/` (e.g. `["api", "v1", "object"]`) | |
60 | | -| `input.headers` | map[string]string | Request headers (lowercased keys) | |
61 | | -| `input.remote_addr` | string | Client address (ip:port) | |
| 171 | +## GitHub App Authentication |
62 | 172 |
|
63 | | -Since `remote_addr` includes the port, use `startswith` to match by IP: |
| 173 | +For private Git repositories and GitHub release assets, configure a GitHub App: |
64 | 174 |
|
65 | | -```rego |
66 | | -deny contains "remote address not allowed" if not startswith(input.remote_addr, "127.0.0.1:") |
| 175 | +```hcl |
| 176 | +github-app { |
| 177 | + app-id = "12345" |
| 178 | + private-key-path = "./github-app.pem" |
| 179 | + installations = { "myorg": "67890" } |
| 180 | +} |
67 | 181 | ``` |
68 | 182 |
|
69 | | -Example policy that requires authentication and blocks writes: |
| 183 | +Installations can also be discovered dynamically via the GitHub API. |
70 | 184 |
|
71 | | -```rego |
72 | | -package cachew.authz |
73 | | -deny contains "unauthenticated" if not input.headers["authorization"] |
74 | | -deny contains "writes are not allowed" if input.method == "PUT" |
75 | | -deny contains "deletes are not allowed" if input.method == "DELETE" |
| 185 | +## CLI |
| 186 | + |
| 187 | +### Server (`cachewd`) |
| 188 | + |
| 189 | +```sh |
| 190 | +cachewd --config cachew.hcl |
| 191 | +cachewd --schema # print config schema |
76 | 192 | ``` |
77 | 193 |
|
78 | | -Policies can reference external data that becomes available as `data.*` in Rego. Provide it inline via `data` or from a file via `data-file`: |
| 194 | +### Client (`cachew`) |
| 195 | + |
| 196 | +```sh |
| 197 | +# Object operations |
| 198 | +cachew get <namespace> <key> [-o file] |
| 199 | +cachew put <namespace> <key> [file] [--ttl 1h] |
| 200 | +cachew stat <namespace> <key> |
| 201 | +cachew delete <namespace> <key> |
| 202 | +cachew namespaces |
| 203 | + |
| 204 | +# Directory snapshots |
| 205 | +cachew snapshot <namespace> <key> <directory> [--ttl 1h] [--exclude pattern] |
| 206 | +cachew restore <namespace> <key> <directory> |
| 207 | + |
| 208 | +# Git |
| 209 | +cachew git restore <repo-url> <directory> [--no-bundle] |
| 210 | +``` |
| 211 | + |
| 212 | +**Global flags:** `--url` (`CACHEW_URL`), `--authorization` (`CACHEW_AUTHORIZATION`), `--platform` (prefix keys with `os-arch`), `--daily`/`--hourly` (prefix keys with date). |
| 213 | + |
| 214 | +## Observability |
79 | 215 |
|
80 | 216 | ```hcl |
81 | | -# Inline JSON data |
82 | | -opa { |
83 | | - policy-file = "./policy.rego" |
84 | | - data = <<EOF |
85 | | - {"allowed_cidrs": ["10.0.0.0/8"], "jwks": {"keys": [...]}} |
86 | | - EOF |
| 217 | +log { |
| 218 | + level = "info" # debug, info, warn, error |
87 | 219 | } |
88 | 220 |
|
89 | | -# Or from a file |
90 | | -opa { |
91 | | - policy-file = "./policy.rego" |
92 | | - data-file = "./opa-data.json" |
| 221 | +metrics { |
| 222 | + service-name = "cachew" |
93 | 223 | } |
94 | 224 | ``` |
95 | 225 |
|
96 | | -```json |
97 | | -{"allowed_cidrs": ["10.0.0.0/8"], "jwks": {"keys": [...]}} |
98 | | -``` |
| 226 | +Admin endpoints: `/_liveness`, `/_readiness`, `PUT /admin/log/level`, `/admin/pprof/`. |
99 | 227 |
|
100 | | -```rego |
101 | | -package cachew.authz |
102 | | -deny contains "address not in allowed CIDR" if not net.cidr_contains(data.allowed_cidrs[_], input.remote_addr) |
103 | | -``` |
| 228 | +## Full Configuration Example |
104 | 229 |
|
105 | | -If `data-file` is not set, `data.*` is empty but policies can still use `http.send` to fetch data at evaluation time. |
| 230 | +```hcl |
| 231 | +state = "./state" |
| 232 | +bind = "0.0.0.0:8080" |
| 233 | +url = "http://cachew.example.com:8080/" |
106 | 234 |
|
107 | | -## Docker |
| 235 | +log { |
| 236 | + level = "info" |
| 237 | +} |
108 | 238 |
|
109 | | -## Hermit |
| 239 | +opa { |
| 240 | + policy = <<EOF |
| 241 | + package cachew.authz |
| 242 | + deny contains "not localhost" if not startswith(input.remote_addr, "127.0.0.1:") |
| 243 | + EOF |
| 244 | +} |
110 | 245 |
|
111 | | -Caches Hermit package downloads from all sources (golang.org, npm, GitHub releases, etc.). |
| 246 | +metrics {} |
112 | 247 |
|
113 | | -**URL pattern:** `/hermit/{host}/{path...}` |
| 248 | +github-app { |
| 249 | + app-id = "12345" |
| 250 | + private-key-path = "./github-app.pem" |
| 251 | +} |
| 252 | +
|
| 253 | +git-clone {} |
| 254 | +
|
| 255 | +git { |
| 256 | + snapshot-interval = "1h" |
| 257 | + repack-interval = "1h" |
| 258 | +} |
114 | 259 |
|
115 | | -Example: `GET /hermit/golang.org/dl/go1.21.0.tar.gz` |
| 260 | +github-releases { |
| 261 | + token = "${GITHUB_TOKEN}" |
| 262 | + private-orgs = ["myorg"] |
| 263 | +} |
| 264 | +
|
| 265 | +gomod { |
| 266 | + proxy = "https://proxy.golang.org" |
| 267 | + private-paths = ["github.com/myorg/*"] |
| 268 | +} |
116 | 269 |
|
117 | | -GitHub releases are automatically redirected to the `github-releases` strategy. |
| 270 | +hermit {} |
| 271 | +
|
| 272 | +host "https://ghcr.io" { |
| 273 | + headers = { |
| 274 | + "Authorization": "Bearer ${GHCR_TOKEN}" |
| 275 | + } |
| 276 | +} |
| 277 | +
|
| 278 | +disk { |
| 279 | + limit-mb = 250000 |
| 280 | + max-ttl = "8h" |
| 281 | +} |
| 282 | +
|
| 283 | +proxy {} |
| 284 | +``` |
0 commit comments