Skip to content

fix(build): retry install.sh fetch on 429 with pipefail + jitter#708

Open
qing-ant wants to merge 2 commits intomainfrom
qing/download-cli-retry
Open

fix(build): retry install.sh fetch on 429 with pipefail + jitter#708
qing-ant wants to merge 2 commits intomainfrom
qing/download-cli-retry

Conversation

@qing-ant
Copy link
Contributor

Context

The v0.1.50 publish hit 429s on curl claude.ai/install.sh across multiple matrix jobs. The 5-target wheel matrix (since #661 added macos-15-intel) fetches install.sh in parallel, and curl's 429 was silently swallowed by | bash without pipefail — subprocess.run reported success but the CLI was never installed, surfacing later as "Could not find installed Claude CLI binary".

This has never completed a real publish since #661 — v0.1.48 (last green, 3/07) predates the 5th matrix target. Today's run required manual re-runs to get through.

Changes

  • curl --retry 5 --retry-delay 2 --retry-all-errors handles 429/5xx on the outer install.sh fetch
  • set -o pipefail propagates curl's exit through the pipe so check=True catches it
  • 0–5s random jitter staggers parallel matrix job starts
  • 3-attempt outer retry with 2^n backoff as a final safety net (covers failures inside install.sh too)

Not in this PR

  • Windows (irm | iex) has the same latent issue but hasn't bitten yet; can follow up if needed
  • test-examples streaming_mode.py all 120s timeout flake — separate issue

5 parallel matrix jobs fetching claude.ai/install.sh trip rate limits.
curl's 429 was swallowed by the pipe without pipefail, so subprocess.run
reported success but the binary was never installed.

- curl --retry 5 --retry-all-errors handles the outer fetch
- pipefail propagates curl exit through the pipe
- 0-5s jitter staggers parallel matrix starts
- 3-attempt outer retry with exponential backoff as a final safety net
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward retry/jitter fix for a documented CI flake.

Extended reasoning...

Overview

This PR modifies a single build script (scripts/download_cli.py) to add resilience against HTTP 429 rate-limiting when multiple CI matrix jobs fetch install.sh in parallel. The changes add curl-level retries, set -o pipefail to propagate curl failures through the pipe, a 0-5s random jitter before the first attempt, and a 3-attempt outer retry loop with exponential backoff.

Security risks

None. This is a build-time script that downloads from a fixed, trusted URL (claude.ai/install.sh). No new inputs, no user-facing code, no auth changes.

Level of scrutiny

Low scrutiny is appropriate. This is a CI/build script fix, not production code or library API. The retry pattern is standard and the motivation (documented 429 failures blocking publishes since #661) is clear.

Other factors

The jitter and retry loop correctly apply to both Windows and Unix code paths. The return on success inside the retry loop correctly exits download_cli(), allowing main() to proceed to copy_cli_to_bundle(). No bugs were found by the automated bug hunting system.

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward build reliability fix adding retry/jitter to handle 429s during parallel CI.

Extended reasoning...

Overview

This PR modifies a single build script (scripts/download_cli.py) to add retry and jitter logic when downloading the Claude CLI via curl | bash. The changes include curl-level retries (--retry 5 --retry-delay 2 --retry-all-errors), set -o pipefail to properly propagate curl failures through the pipe, a 0–5s random jitter to stagger parallel matrix builds, and a 3-attempt outer retry loop with exponential backoff.

Security risks

None. This is a build-time script that fetches from a fixed URL (claude.ai/install.sh). No user input is interpolated into shell commands beyond the version string which comes from an environment variable already used in the existing code. The retry/jitter changes are purely additive resilience logic.

Level of scrutiny

Low scrutiny warranted. This is a CI/build script, not production application code. The change is well-motivated by a documented CI failure (429s from parallel matrix jobs), and the implementation uses standard, well-understood patterns (curl retries, pipefail, exponential backoff with jitter).

Other factors

The retry loop correctly applies to both Unix and Windows code paths. The error reporting is preserved (stdout/stderr from the last failed attempt are printed). The PR description is thorough and clearly explains the root cause and fix. No outstanding reviewer comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants