Skip to content

SamHarrison/arxivhaiku

Repository files navigation

arxivhaiku

Heroku-haikunator-style two-word identifiers backed by a vetted, fixed-size wordlist.

arxivhaiku pairs an adjective with a noun to produce friendly aliases like frosty-meadow, alpine-pixel, or gentle-eagle. Words are 4–7 letters, lowercase a–z, drawn from two curated lists shipped with the package.

The two pool sizes are chosen so the alias space is exactly equivalent to a 5-character Crockford Base32 token — a clean bijection between three forms of the same ID:

Form Example Range
canonical integer 1234567 0 … 33_554_431 (25 bits)
Crockford token 15NM7 5 chars, alphabet 0-9A-Z\{ILOU}
alias alpine-pixel 4–7 letter adj + - + 4–7 letter noun
2^12 adjectives  ×  2^13 nouns  =  2^25 aliases  =  32^5 Crockford tokens
   4,096               8,192      33,554,432

The repo ships both a Python package and a TypeScript package backed by the same wordlist files. Use whichever fits your stack.

Table of contents


Why arxivhaiku?

You have an integer ID space (database row IDs, content hashes, allocated counters) and want to surface it to humans as something memorable instead of 12345 or K3X9P.

Existing Heroku-haikunator implementations (Ruby, JS, Python) ship a few dozen adjectives and nouns — fine for "give me a unique-looking app name" but the alias space is tiny (~4,000 distinct pairs) and the lists were chosen ad-hoc. arxivhaiku provides:

  • A larger, audited pool. 4,096 adjectives × 8,192 nouns = 33.5M unique aliases, drawn from 11 source wordlists (Heroku, haikunatorjs, Docker moby, Wordle, EFF, BIP-39, SCOWL substitute, Project-Gutenberg-derived POS data) and filtered against profanity (LDNOOBW), brand names, demonyms, biology genera, proper nouns, plurals, comparatives, and more.
  • A clean bijection. Pool sizes are powers of two so every canonical integer maps to exactly one alias with no gaps or cycle-walking. The same 25-bit integer is also exactly 5 Crockford Base32 characters.
  • A reproducible, auditable build. 10 idempotent pipeline scripts produce the lists from raw sources. Every dropped word is logged with its drop reason in docs/BLOCKLIST.md. The final pair-audit flag rate is 0.038% (19 in 50,000 random pairs, all cross-boundary substring false positives).
  • Versioning rules for production. Words can never be removed once aliases are issued (immutability), only deprecated. v2 wordlists must be supersets of v1. See docs/EXTENSION.md.

Install (Python)

pip install -e .

Python 3.10 or newer. No runtime dependencies beyond the standard library — the wordlists ship with the package and the bijection is pure integer arithmetic.

(Build dependencies — nltk, pandas, jellyfish, requests — are only needed if you want to rebuild the wordlists from source. See Reproducing the build.)

Install (TypeScript / JavaScript)

Web app integration? Jump straight to docs/WEBAPP.md for the full Next.js + Drizzle integration guide (Server Actions, Route Handlers, middleware, Edge runtime, ID allocation strategies, common pitfalls).

Install directly from GitHub — no npm publish required. The TS package lives at the repo root and ships its built dist/ so consumers get a ready-to-use ESM + CJS module with full type declarations.

# pin to a tagged release (recommended for production)
pnpm add github:SamHarrison/arxivhaiku#semver:^1.0.2

# or pin to a specific commit
pnpm add github:SamHarrison/arxivhaiku#63542b6

# or track master (development only)
pnpm add github:SamHarrison/arxivhaiku

(Same syntax works with npm add or yarn add. The #semver: selector matches the latest tag satisfying the SemVer range.)

Updating later:

# move to a newer tag
pnpm up arxivhaiku

Runtime targets supported: Node ≥ 18, Vercel Edge, browsers, Deno, Bun. No filesystem I/O — wordlists are inlined into the bundle (~150KB ESM, gzips to ~25KB). No runtime dependencies.

The TS package shares its adjectives.txt and nouns.txt with the Python package — a CI check fails if src/wordlists.generated.ts drifts from the canonical wordlist files, so the two packages can never get out of sync.

import { haiku, encode, decode, encodeCrockford, decodeCrockford } from "arxivhaiku";

haiku();                          // 'frosty-meadow'
encode(1234567);                  // 'alpine-pixel'
decode("alpine-pixel");           // 1234567
encodeCrockford(1234567);         // '15NM7'
decodeCrockford("15NM7");         // 1234567

// Class-based, optionally seeded (NOT for production IDs — predictable PRNG)
import { Haikunator } from "arxivhaiku";
const h = new Haikunator({ seed: 42 });
h.haikunate();                    // reproducible with this seed

// Verify shipped wordlist version
import { ADJECTIVES_SHA256, NOUNS_SHA256, VERSION } from "arxivhaiku";
console.log(VERSION, ADJECTIVES_SHA256, NOUNS_SHA256);

Next.js usage notes

For a complete integration guide — including Drizzle schema, Server Actions, Route Handlers, middleware, ID allocation strategies, runtime considerations, and a full reference Next.js example — see docs/WEBAPP.md.

TL;DR pattern:

// app/items/[alias]/page.tsx
import { decode, InvalidAliasError } from "arxivhaiku";
import { notFound } from "next/navigation";

export default async function Page({ params }: { params: Promise<{ alias: string }> }) {
  const { alias } = await params;
  let id: number;
  try { id = decode(alias); }
  catch (e) { if (e instanceof InvalidAliasError) notFound(); throw e; }
  const item = await db.query.items.findFirst({ where: eq(items.id, id) });
  if (!item) notFound();
  return <ItemView item={item} />;
}

Key points:

  • Codec is sync, deterministic, zero I/O — works in Server Components, Server Actions, Route Handlers, middleware, or Client Components.
  • For Edge runtime, add export const runtime = 'edge' to the route file.
  • Store the canonical as INTEGER (4 bytes); reconstruct the alias on read via encode(row.id). See Storage and URL patterns.

Quick start

from arxivhaiku import haiku, encode, decode

haiku()                # 'frosty-meadow' — random, cryptographic-grade entropy
encode(1234567)        # 'alpine-pixel'  — integer → alias
decode('alpine-pixel') # 1234567          — alias → integer (round-trips)

Or from the shell:

$ arxivhaiku
gentle-eagle

$ arxivhaiku gen -n 3
plumy-doodad
sleepy-panda
brave-otter

Library API

from arxivhaiku import (
    haiku, encode, decode,
    Haikunator,
    ADJ_BITS, NOUN_BITS, CANON_BITS, CANON_CHARS,
    InvalidAliasError, InvalidCanonicalError,
    list_adjectives, list_nouns,
)
from arxivhaiku.codec import (
    encode_crockford, decode_crockford,
    ADJ_COUNT, NOUN_COUNT, CANON_MAX,
)

haiku(*, separator="-", rng=None) → str

Return a uniformly-random alias. Uses secrets.SystemRandom by default — cryptographic-grade entropy suitable for production IDs. Pass an alternate rng (any object with a randrange(n) method) for deterministic output.

haiku()                  # 'frosty-meadow'
haiku(separator="_")     # 'frosty_meadow'

haiku() is not deduplicated across calls. Two calls can collide. For uniqueness, store issued aliases and reject duplicates, or allocate canonicals from a counter and encode() them.

encode(canonical: int, *, separator="-") → str

Convert a canonical integer to its alias. The integer must be in [0, 33_554_431]. The math:

adj_index  = canonical >> 13              # top 12 bits
noun_index = canonical & 0x1FFF           # bottom 13 bits
alias      = adjectives[adj_index] + "-" + nouns[noun_index]
encode(0)            # 'aaronic-aalii'    — first canonical
encode(1234567)      # 'alpine-pixel'
encode(0xABC123)     # 'fangled-apnea'    — hex input is fine
encode(33_554_431)   # 'zoning-zoril'     — last canonical
encode(-1)           # InvalidCanonicalError
encode(33_554_432)   # InvalidCanonicalError

decode(alias: str, *, separator="-") → int

Inverse of encode. Splits on the separator, looks up each word's index in the sorted wordlists (O(1) via dicts built at module import), and reconstructs the integer: (adj_index << 13) | noun_index.

decode('alpine-pixel') # 1234567
decode('alpine_pixel', separator='_')
decode('xxxxx-eagle')  # InvalidAliasError: unknown adjective
decode('brave-xxxxx')  # InvalidAliasError: unknown noun
decode('nodash')       # InvalidAliasError: malformed

decode(encode(c)) == c for every valid canonical c. Tested over 1024 samples spanning the full 25-bit space (tests/test_codec.py:TestBijection).

encode_crockford(canonical: int) → str / decode_crockford(token: str) → int

The same canonical integer rendered as a 5-character Crockford Base32 token. Different surface form, same identity.

from arxivhaiku.codec import encode_crockford, decode_crockford

encode_crockford(1234567)       # '15NM7'
decode_crockford('15NM7')       # 1234567
decode_crockford('15nm7')       # 1234567  (case-insensitive)
decode_crockford('15NM7-')      # 1234567  (hyphens stripped)
decode_crockford('I5NM7')       # 1234567  (Crockford normalization: I → 1)
decode_crockford('L5NM7')       # 1234567  (L → 1)
decode_crockford('O5NM7')       # 1234567  (O → 0)

The Crockford alphabet is 0123456789ABCDEFGHJKMNPQRSTVWXYZ — 32 chars with I, L, O, U excluded. Decoder normalizes I/L1 and O0 so handwritten tokens can't be ambiguous.

Haikunator class

A stateful, optionally seeded generator. Useful for tests where you want reproducible aliases.

from arxivhaiku import Haikunator

h = Haikunator(seed=42)
h.haikunate()    # always 'mucky-cither' with this seed
h.haikunate()    # always 'notal-pint'   (next call)

# Same seed → same sequence
h2 = Haikunator(seed=42)
assert h2.haikunate() == 'mucky-cither'

Do not use Haikunator(seed=...) for production IDs. The random.Random PRNG it uses is predictable. For production, use haiku() (or Haikunator() with no seed, which falls back to secrets.SystemRandom).

Errors

  • InvalidCanonicalError — canonical integer out of range or not an int.
  • InvalidAliasError — alias string malformed, or adj/noun not in pool.

Constants

Name Value Meaning
ADJ_BITS 12 bits encoding the adjective index
NOUN_BITS 13 bits encoding the noun index
CANON_BITS 25 total canonical bits (12 + 13)
ADJ_COUNT 4,096 adjectives in pool
NOUN_COUNT 8,192 nouns in pool
CANON_MAX 33,554,431 largest valid canonical
CANON_CHARS '0123…XYZ' Crockford alphabet

CLI

Installed as arxivhaiku (also runnable via python -m arxivhaiku).

$ arxivhaiku                        # one alias (default subcommand: gen)
gentle-eagle

$ arxivhaiku gen -n 5               # five aliases
plumy-doodad
sleepy-panda
frosty-meadow
brave-otter
silver-comet

$ arxivhaiku gen --sep _            # underscore separator
sleepy_panda

$ arxivhaiku encode 1234567         # int → alias
alpine-pixel

$ arxivhaiku encode 0xABC123        # hex int → alias
fangled-apnea

$ arxivhaiku encode 15NM7           # Crockford → alias
alpine-pixel

$ arxivhaiku encode --crockford-out 42
abase-acari    0000A    42

$ arxivhaiku decode alpine-pixel    # alias → all three forms
1234567    0x012d687    15NM7

The bijection

Three interchangeable representations of the same 25-bit identifier:

            ┌─────────────────────────────┐
            │     25-bit integer          │
            │  canonical ∈ [0, 33_554_431) │
            └──────────────┬──────────────┘
              ▲            │              ▲
              │            ▼              │
   encode/    │   ┌────────────────┐      │  encode_crockford /
   decode     │   │  alias string  │      │  decode_crockford
              │   │ 'alpine-pixel' │      │
              │   └────────────────┘      │
              │                           │
              └───────────────────────────┘
                       ┌──────────┐
                       │ Crockford│
                       │  '15NM7' │
                       └──────────┘

The integer is the truth. Aliases and Crockford tokens are presentation forms. Convert freely:

canonical = 1234567
alias     = encode(canonical)              # 'alpine-pixel'
token     = encode_crockford(canonical)    # '15NM7'

assert decode(alias)            == canonical
assert decode_crockford(token)  == canonical

Why exactly 4,096 × 8,192?

Power-of-two pool sizes make the encoding a single bit-shift:

adj_index  = canonical >> 13   # top 12 bits  (2^12 = 4096)
noun_index = canonical & 0x1FFF # bottom 13 bits (2^13 = 8192)

If pools were, say, 4,000 and 8,000, some integer values in [0, 32M) would point to nonexistent indices, and you'd need extra logic (modulo, cycle-walking) to skip them. Power-of-two pools fill the entire 25-bit space with no gaps and let the encoding be a pure arithmetic operation.

The sizes were also chosen to land on a single Crockford-base32 boundary: 25 bits = exactly 5 chars (since 32 = 2⁵). So a Crockford token is also a direct re-encoding of the canonical with no padding or wasted bits.

How the wordlists were built

The full narrative is in docs/PROCESS.md. Briefly:

Step Script Purpose
1 01_acquire.py Fetch 11 source wordlists (Heroku, moby, Wordle, EFF, BIP-39, SimpleWordlists, dwyl, LDNOOBW). Record URL + SHA-256 + license.
2 02_pos_tag.py Union + dedup, then POS-tag every candidate via WordNet + reference-list assertion.
3 03_length_filter.py Keep 4–7 letter words with adj or noun POS.
4 04_quality_filter.py Drop profanity, brands, plurals, past-tense/gerunds, comparatives, body/medical, demonyms, biology genera, proper nouns.
5 05_phonetic.py Compute Metaphone code per word.
6 06_tone_score.py Score playfulness (adj) and concreteness (noun) using WordNet lexnames + reference-list overlap.
7 07_select.py Pool-assign ambiguous candidates; sort by source-quality tier + tone score; phonetic dedup; select top 4,096 / 8,192.
8 08_pair_audit.py Audit 10,000 random pairs for profanity substrings and slur formation.
9 09_self_review.py Hand-curated removal list (in lieu of synchronous human review — see docs/TONE.md).
10 10_finalize.py Apply removals, backfill from overflow, sort alphabetically, emit SHA-256.

Every step writes an intermediate TSV/TXT to data/ so the build is auditable end-to-end. Every dropped word is logged in docs/BLOCKLIST.md with reason.

The final 50,000-pair audit flagged 19 pairs (0.038%), all of which are cross-boundary substring false positives (e.g., furious+catcallfuriouscatcall contains scat at the join). No standalone problem words remain.

Storage and URL patterns

The package doesn't prescribe how you store IDs — but the bijection makes several patterns clean.

Recommended (database column = canonical int):

CREATE TABLE items (
  id INTEGER PRIMARY KEY,  -- canonical, 0..33_554_431
  ...
);

4 bytes per row, indexable, sortable. Display via encode(row['id']). This is the most common pattern.

Alternative (column = Crockford string):

CREATE TABLE items (
  id CHAR(5) PRIMARY KEY,  -- Crockford token, always uppercase, 5 chars
  ...
);

Slightly larger (5 bytes vs 4), but human-shareable as the storage form itself. Useful if you're integrating with systems that already use short-token IDs (ULID, KSUID — though those are longer).

Not recommended (column = alias string):

Variable length (9–15 chars typically), more index storage, slower B-tree lookups, and ties your DB schema to a specific wordlist version. If you want to retire a word later, you'd need to migrate every row that contains it. Keep the alias as a derived presentation form, not the storage key.

URLs:

/items/alpine-pixel      ← friendly, shareable, memorable
/items/15NM7             ← compact, suitable for SMS / QR
/items/1234567           ← discouraged: leaks row counts

Pick one canonical URL format. The dynamic route can decode() the alias to find the row, with a clean 404 path on InvalidAliasError:

# Flask / FastAPI / etc.
try:
    canonical = decode(alias)
except InvalidAliasError:
    abort(404)
row = db.fetch_one("SELECT * FROM items WHERE id = ?", canonical)

Quality and safety

The lists have been:

  • Filtered against LDNOOBW profanity (exact match on individual words; substring scan on pair concatenations).
  • Filtered against a curated demonym/religion list (indian, french, hindu, klan, etc.) to avoid producing demographically charged aliases.
  • Filtered against brand names (apple, tesla, oracle, nike, etc.) so the system doesn't trip trademark concerns.
  • Filtered against body parts, medical jargon, and biology genus names (liver, aortal, psylla, arundo).
  • Filtered against archaic/dialect words (dreich, couthy, ugsome) to keep tone playful and modern.
  • Phonetically de-duplicated within each pool (Metaphone + Damerau- Levenshtein ≤ 1) so confusable pairs like gold/cold don't both appear.
  • Pair-audited at 50,000 random pairs with 0.038% flag rate — all flags are cross-boundary substring false positives.

The shipped adjectives.txt and nouns.txt are SHA-256-pinned in docs/CHANGELOG.md. Verify before deploying:

sha256sum adjectives.txt nouns.txt
# adjectives.txt:  ffec07d411421cb6e47c6311c2d7d77dddbc23c7d8bce2da926bea2d432df992
# nouns.txt:       e3a439f20fe4bff99ebd51e13170b69d017460c6c55d20481726e0a755c06c3c

The SHA-256 of the shipped wordlists is also available at runtime as the ADJECTIVES_SHA256 and NOUNS_SHA256 exports in both packages, so you can verify wordlist identity in deploy-time smoke tests.

Caveats (honest caveats below) — no filter is perfect. If your application surfaces aliases to users in safety-critical contexts, run your own LDNOOBW substring screen at issue time.

Versioning and immutability

Once you issue aliases from a given wordlist version in production, you can never remove a word from that version. A production alias is a foreign key. Removing the underlying word would orphan every alias that referenced it. See docs/EXTENSION.md for the full rules:

  • Removals happen only via a separate deprecated.txt overlay that the resolver still recognizes (so old aliases continue to work) but the generator refuses to use (so no new aliases get the deprecated word).
  • v2 wordlists must be supersets of v1 — every v1 word at the same index. New words append.
  • The bijection math expands naturally: 6-char Crockford = 30 bits, which could be a 16,384 × 65,536 pool (14 + 16 bits) preserving every v1 alias's index.

Honest caveats

  • ~1,299 of the 4,096 adjectives lack a strict WordNet adjective synset. They appear in independently-curated adjective sources (simple_adjectives, Heroku haikunator, haikunatorjs) which we trust as POS evidence. WordNet alone yields only ~3,250 4–7 letter adjective synsets — short of the 4,096 target. Documented in docs/PROCESS.md and docs/TONE.md.
  • A small fraction of random pairs (~0.04%) produce cross-boundary substring matches with profanity (e.g., furious+catcallfuriouscatcall contains scat). These don't form recognizable words but the substring exists if the alias is character-grepped. Applications in safety-critical contexts should LDNOOBW-screen final aliases.
  • Tone is subjective. The Heroku-style playful aesthetic was calibrated by sampling. See docs/TONE.md for the specific calls made and the rationale.
  • Length expanded to 4–7 letters during build (the original spec called for 4–6) to enable stricter quality filters. Documented in docs/CHANGELOG.md §"Length constraint deviation".
  • Step 9 (human review) was performed programmatically in lieu of a synchronous human reviewer, per explicit user instruction. The HARD_REMOVE list in scripts/09_self_review.py is the audit trail. A future human reviewer should re-audit before promoting to a release used in compliance-sensitive contexts.

Project layout

adjectives.txt               ← canonical wordlist (4,096 lines)
nouns.txt                    ← canonical wordlist (8,192 lines)
                               — the single source of truth, shared
                                 by both packages.

# Python package
arxivhaiku/                  ← Python package source
  __init__.py                  public API
  codec.py                     bijection + RNG
  __main__.py                  CLI
  data/                        bundled copy (auto-synced)
pyproject.toml               ← Python build manifest
tests/test_codec.py          ← 27 Python unit tests

# TypeScript package
src/                         ← TS package source
  index.ts                     public API re-exports
  codec.ts                     bijection + RNG
  wordlists.generated.ts       ← AUTO-GENERATED from .txt files (committed)
dist/                        ← built ESM + CJS + .d.ts (gitignored)
package.json                 ← npm/pnpm manifest
tsconfig.json
tsup.config.ts
vitest.config.ts
scripts/gen-wordlists.mjs    ← regenerates wordlists.generated.ts
test/codec.test.ts           ← 38 Vitest tests (mirrors Python suite)

# Build pipeline (Python)
scripts/
  01_acquire.py                ← idempotent pipeline scripts;
  02_pos_tag.py                  re-run any time to verify the build.
  03_length_filter.py
  04_quality_filter.py
  05_phonetic.py
  06_tone_score.py
  07_select.py
  08_pair_audit.py
  09_self_review.py
  10_finalize.py
  quality_gates.py             ← 14 acceptance checks

# Pipeline data (committed for audit)
data/
  raw/                         downloaded source wordlists + SOURCES.md
  02_*.tsv … 10_sha256.txt     per-step outputs

# Documentation
docs/
  WEBAPP.md                    web app integration guide (Next.js+Drizzle)
  PROCESS.md                   build pipeline narrative
  SOURCES.md                   input wordlists + licenses
  STATISTICS.md                final pool characteristics
  BLOCKLIST.md                 every dropped word + reason (~7.5K entries)
  TONE.md                      subjective calls + reasoning
  EXTENSION.md                 immutability + v2 rules
  CHANGELOG.md                 release notes + SHA-256 pins

# Repo-level
.github/workflows/ci.yml     ← Python + TS tests on every push/PR;
                               fails if wordlists.generated.ts is stale.
LICENSE / NOTICE             ← MIT + source-wordlist attribution
CLAUDE.md                    ← original spec / build prompt

Reproducing the build

The wordlists are committed and SHA-256-pinned, so you don't need to rebuild to use either package. But the build is fully reproducible:

Rebuilding the wordlists (Python pipeline)

# 1. install build deps (runtime needs none of these)
pip install requests jellyfish pandas nltk
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4'); nltk.download('brown', quiet=True); nltk.download('universal_tagset', quiet=True)"

# 2. run the pipeline (each step is idempotent; re-running is safe)
python scripts/01_acquire.py        # downloads raw sources to data/raw/
python scripts/02_pos_tag.py
python scripts/03_length_filter.py
python scripts/04_quality_filter.py
python scripts/05_phonetic.py
python scripts/06_tone_score.py
python scripts/07_select.py
python scripts/08_pair_audit.py
python scripts/09_self_review.py
python scripts/10_finalize.py       # writes adjectives.txt + nouns.txt

# 3. verify
python scripts/quality_gates.py     # 14 acceptance checks
python -m unittest discover tests   # 27 unit tests

The build is deterministic up to WordNet version and library versions. The exact SHA-256 of the shipped files is pinned in docs/CHANGELOG.md.

Rebuilding the TS package

pnpm install
pnpm run gen       # regenerates src/wordlists.generated.ts from .txt files
pnpm run build     # tsc + tsup → dist/ (ESM + CJS + .d.ts)
pnpm test          # 38 Vitest tests

CI runs both rebuilds on every push and fails if src/wordlists.generated.ts would diff after pnpm run gen — the mechanism that prevents the Python wordlists and the TS module from silently drifting apart.

License

MIT — see LICENSE.

The shipped wordlists are derived from open-source inputs under MIT, CC-BY 3.0/4.0, Apache-2.0, BSD-2-Clause, public domain, and Unlicense terms. Per-source attribution is in docs/SOURCES.md.

About

Heroku-haikunator-style two-word identifiers (4,096 adjectives × 8,192 nouns) with a clean bijection to 25-bit canonical / 5-char Crockford Base32. Curated, audited, MIT.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors