Skip to content

Add retry logic to checkout and submoduleUpdate for partial clone resilience#2392

Open
jasonwbarnett wants to merge 1 commit intoactions:mainfrom
altana-ai:retry-checkout-for-partial-clones
Open

Add retry logic to checkout and submoduleUpdate for partial clone resilience#2392
jasonwbarnett wants to merge 1 commit intoactions:mainfrom
altana-ai:retry-checkout-for-partial-clones

Conversation

@jasonwbarnett
Copy link

Problem

When using partial clones (filter=blob:none, which is automatically set for sparse checkouts), git checkout lazily fetches missing blobs from the promisor remote. If the remote is temporarily unavailable, this network call fails with no retry:

/usr/bin/git checkout --progress --force -B <branch> <ref>
Error: fatal: unable to access 'https://github.com/<org>/<repo>/': Failed to connect to github.com port 443 after 135272 ms: Couldn't connect to server
Error: fatal: could not fetch <sha> from promisor remote
Error: The process '/usr/bin/git' failed with exit code 128

This was observed in production during a brief GitHub git service outage. The workflow used sparse-checkout which triggers filter=blob:none, making checkout depend on network availability.

Root Cause

The fetch, getDefaultBranch, and lfsFetch methods in git-command-manager.ts already wrap their git calls with retryHelper.execute(), but checkout and submoduleUpdate do not — despite both performing network operations:

  • checkout: With partial clones, git lazily fetches missing blobs from the promisor remote during checkout
  • submoduleUpdate: Clones/fetches submodule repositories from their remotes

Fix

Wrap both checkout() and submoduleUpdate() with the existing retryHelper.execute() (3 attempts, 10-20s jittered backoff), consistent with how fetch() already handles transient failures.

Testing

  • All 96 existing tests pass (npm test)
  • Code formatted with prettier (npm run format)
  • dist/index.js rebuilt (npm run build)

…ilience

When using partial clones (filter=blob:none, which is automatically set
for sparse checkouts), `git checkout` lazily fetches missing blobs from
the promisor remote. If the remote is temporarily unavailable, this
network call fails and surfaces as a hard error with no retry.

The `fetch`, `getDefaultBranch`, and `lfsFetch` methods already use
retryHelper, but `checkout` and `submoduleUpdate` did not, despite both
performing network operations:

- `checkout`: fetches blobs on-demand from promisor remotes during
  partial clone checkouts
- `submoduleUpdate`: clones/fetches submodule repositories

This was observed in production when GitHub's git service had a brief
outage, causing the checkout step to fail with:

  fatal: unable to access '...': Failed to connect to github.com port
  443 after 135272 ms: Couldn't connect to server
  fatal: could not fetch <sha> from promisor remote

Wrapping both methods with the existing retryHelper (3 attempts with
10-20s jittered backoff) makes these operations resilient to transient
network failures, consistent with how fetch already behaves.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant