archive-it-client

Rust client for Archive-It's partner API and WASAPI.

Inspiration and examples have been drawn from:

Overview

There are three clients, each scoped to what its endpoints expose under that auth state:

use archive_it_client::{PageOpts, PartnerClient, PublicClient, WasapiClient, WebdataQuery};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let user = "user";
    let pass = "pass";

    // public — no auth, partner registry + public collections
    let public = PublicClient::new()?;
    let accounts = public.list_accounts(PageOpts::default()).await?;
    let collection = public.get_collection(2135).await?;

    // partner — auth scopes every call to your own account
    let partner = PartnerClient::new(user, pass)?;
    let me = partner.my_account().await?;
    let mine = partner.list_collections(PageOpts::default()).await?;

    // wasapi — WARC manifests for a collection
    let wasapi = WasapiClient::new(user, pass)?;
    let query = WebdataQuery {
        collection: Some(4472),
        ..Default::default()
    };
    let page = wasapi.list_webdata(&query).await?;

    Ok(())
}

Timeouts and retries (default: 30s, 3 attempts, 250ms exponential backoff; retries on 5xx, 429, timeouts, and connection errors) are configured via Config:

use std::time::Duration;
use archive_it_client::{Config, PartnerClient};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let user = "user";
    let pass = "pass";

    let mut cfg = Config::api();
    cfg.timeout = Duration::from_secs(10);
    cfg.max_attempts = 5;
    let client = PartnerClient::with_config(user, pass, cfg)?;

    Ok(())
}

Pagination

There are two options: streaming for transparent pagination, per-page methods for manual control. Streaming hides the offset/cursor bookkeeping for both API styles behind a uniform Stream<Item = Result<T, Error>>.

Streaming

Each list endpoint has a streaming variant. Pages are fetched lazily as items are pulled; dropping the stream stops mid-traversal:

use archive_it_client::{PartnerClient, PublicClient, WasapiClient, WebdataQuery};
use futures::TryStreamExt;  // for try_collect / try_next / try_filter / ...

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let user = "user";
    let pass = "pass";
    let public = PublicClient::new()?;
    let partner = PartnerClient::new(user, pass)?;
    let wasapi = WasapiClient::new(user, pass)?;

    let all: Vec<_> = public.accounts().try_collect().await?;
    let mine: Vec<_> = partner.collections().try_collect().await?;

    let query = WebdataQuery {
        collection: Some(4472),
        ..Default::default()
    };
    let mut files = Box::pin(wasapi.webdata(query));
    while let Some(file) = files.try_next().await? {
        // process one file at a time
    }

    Ok(())
}

The streaming methods are:

Client	Method
`PublicClient`	`accounts()`, `collections(account_id: Option<u64>)`
`PartnerClient`	`collections()`
`WasapiClient`	`webdata(query: WebdataQuery)`

Internally, PublicClient and PartnerClient streams fetch 100 items per request. WasapiClient defaults to page_size=50 unless you override it in WebdataQuery.

The streams expose the standard futures_core::Stream trait. To use the extension methods shown above (try_collect, try_next, try_filter, take, …) add a stream-utilities crate to your Cargo.toml:

[dependencies]
futures = "0.3"           # or tokio-stream = "0.1"

Per-page

When you want to control page size or read pagination metadata (WASAPI's count, next), use the lower-level methods:

use archive_it_client::{PageOpts, PublicClient, WasapiClient, WebdataQuery};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let public = PublicClient::new()?;
    let wasapi = WasapiClient::new("user", "pass")?;

    // /api — caller passes limit/offset, gets a Vec
    let batch = public
        .list_accounts(PageOpts { limit: Some(50), offset: Some(0) })
        .await?;

    // wasapi — server-driven cursor; follow `next` until exhausted
    let query = WebdataQuery {
        collection: Some(4472),
        ..Default::default()
    };
    let mut page = wasapi.list_webdata(&query).await?;
    println!("{} files total", page.count);
    loop {
        for file in &page.files { /* ... */ }
        match wasapi.list_webdata_next(&page).await? {
            Some(next) => page = next,
            None => break,
        }
    }

    Ok(())
}

Downloads

Two destinations: local filesystem and S3. Both skip the fetch when the destination already matches — by sha1 when WASAPI supplied one, otherwise by file size. Every download method returns a Stream of DownloadOutcome events — Progress / Downloaded / Skipped / Failed per file, plus StreamFailed for errors that occur before a file is available — so callers can render progress and react to failures uniformly, whether they're downloading one file or a whole collection.

use std::pin::pin;
use archive_it_client::{WasapiClient, WebdataQuery};
use futures::{StreamExt, TryStreamExt};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let wasapi = WasapiClient::new("user", "pass")?;

    // single file → ./out.warc.gz, with progress events
    let file = pin!(wasapi.webdata(WebdataQuery {
        collection: Some(4472),
        page_size: Some(1),
        ..Default::default()
    }))
    .try_next().await?.ok_or("empty")?;
    let mut single = pin!(wasapi.download(file, "./out.warc.gz"));
    while let Some(outcome) = single.next().await {
        println!("{outcome}");
    }

    // whole collection → ./warcs, also a stream of outcomes per file
    let query = WebdataQuery { collection: Some(4472), ..Default::default() };
    let mut stream = pin!(wasapi.download_collection(query, "./warcs"));
    while let Some(outcome) = stream.next().await {
        println!("{outcome}");
    }
    Ok(())
}

Local downloads use a <filename>.part sidecar so an interrupted run resumes on the next invocation.

S3

WasapiClient::download_to_s3 and download_collection_to_s3 accept a pre-built aws_sdk_s3::Client, so credentials, region, and HTTP wiring stay under your control. Multipart upload is driven internally with server-side crc64nvme as the at-rest integrity contract; sha1 (when supplied by WASAPI) is recorded as user metadata so subsequent runs can skip on match.

The S3 principal needs s3:GetObject, s3:ListBucket, s3:PutObject, and s3:AbortMultipartUpload on the target.

Examples

Runnable examples live under examples/:

# no auth — public partner registry
cargo run --example public

# partner API — needs ARCHIVE_IT_USERNAME/ARCHIVE_IT_PASSWORD set
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass cargo run --example partner

# wasapi — needs ARCHIVE_IT_USERNAME/ARCHIVE_IT_PASSWORD set
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass cargo run --example wasapi

# inventory every WARC exposed by WASAPI into ./warcs.csv
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass cargo run --example warcs_inventory

# tally total WARC bytes across every collection on the account
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass cargo run --example count_bytes

# download a collection to ./warcs (resumes via .part sidecars)
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass cargo run --example download_collection

# upload one WARC to S3 (uses standard AWS provider chain for creds)
ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass S3_BUCKET=my-bucket \
    cargo run --example download_s3

The authenticated examples fail fast if ARCHIVE_IT_USERNAME or ARCHIVE_IT_PASSWORD is unset.

Fixtures

JSON fixtures under fixtures/ are generated by fixtures.sh. It requires ARCHIVE_IT_USERNAME and ARCHIVE_IT_PASSWORD (Archive-It partner credentials) to be set:

ARCHIVE_IT_USERNAME=user ARCHIVE_IT_PASSWORD=pass ./fixtures.sh

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
examples		examples
fixtures		fixtures
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
fakeify.jq		fakeify.jq
fixtures.sh		fixtures.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

archive-it-client

Overview

Pagination

Streaming

Per-page

Downloads

S3

Examples

Fixtures

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

archive-it-client

Overview

Pagination

Streaming

Per-page

Downloads

S3

Examples

Fixtures

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages