✨ Wisp

A production-quality, self-hosted GitHub Copilot alternative.

Runs entirely offline on your CPU using llama.cpp, a Node.js middleware, and a VS Code extension.

⚡ Quick Start (Automated Installation)

The easiest way to get Wisp up and running is to use our provided 1-click installation scripts. This will automatically download and compile llama.cpp, grab the recommended language model, and build the middleware and VS Code extension.

🐧 Linux & 🍎 macOS

chmod +x install.sh
./install.sh

🪟 Windows (Experimental)

Open PowerShell and run:

.\install.ps1

Note: The automated installation scripts will place the DeepSeek-Coder 6.7B Q4_K_M model inside the ./models directory and assume you have standard toolchains installed (git, cmake, python, npm).

Once the installation is complete, you can start the Wisp backend services natively:

./start.sh

(For Windows users, please start the llama.cpp and Node middleware manually as referenced below if a native start.ps1 is not available).

🛠️ Manual Setup

If you prefer to set up Wisp manually instead of using the automated scripts, follow these steps.

1. Setup — `llama.cpp`

Build:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_NATIVE=ON
cmake --build build --config Release -j$(nproc)

Download Model (DeepSeek-Coder 6.7B Q4_K_M recommended):

pip install huggingface_hub
~/.local/bin/huggingface-cli download \
  TheBloke/deepseek-coder-6.7B-instruct-GGUF \
  deepseek-coder-6.7b-instruct.Q4_K_M.gguf \
  --local-dir ./models

Lighter alternatives: starcoder2-3b-Q4_K_M.gguf (fast), codellama-7b.Q4_K_M.gguf

Start the Server:

./build/bin/llama-server \
  -m ./models/deepseek-coder-6.7b-instruct.Q4_K_M.gguf \
  -c 4096 \
  --threads 8 \
  --parallel 4 \
  --port 8080 \
  --host 127.0.0.1 \
  --batch-size 512 \
  --mlock \
  --no-mmap \
  --log-disable

2. Setup — Middleware

cd middleware
npm install
npm run dev          # development with hot reload
# OR
npm run build && npm start   # production

Environment variables (optional):

Variable	Default	Description
`LLAMA_URL`	`http://127.0.0.1:8080`	llama.cpp server URL
`MODEL_FAMILY`	`deepseek`	Prompt format: `deepseek`, `codellama`, `starcoder`, `generic`
`PORT`	`3000`	Middleware listening port

3. Setup — VS Code Extension

cd vscode-extension
npm install
npm run compile

To test, press F5 in VS Code to launch the Extension Development Host, or run vsce package to bundle and use code --install-extension <name>.vsix.

🏗️ Architecture

flowchart TD
    vs_code["💻 VS Code Editor\n(InlineCompletionItemProvider)"]
    middleware["⚙️ Node.js Middleware (:3000)\n(Context Trimmer & Prompt Builder)"]
    llama_cpp["🧠 llama.cpp HTTP Server (:8080)\n(DeepSeek-Coder GGUF Model)"]
    
    vs_code -- "POST /complete (SSE stream)" --> middleware
    middleware -- "POST /completion (NDJSON stream)" --> llama_cpp

🧠 Prompt Engineering Details

Why FIM (Fill-In-the-Middle)?

Standard left-to-right generation only sees context before the cursor. FIM gives the model both prefix AND suffix, so it:

Knows what comes after the cursor and won't duplicate it.
Produces completions that fit naturally into the surrounding code.
Mimics the same technique used by GitHub Copilot and Cursor.

DeepSeek-Coder FIM format:

<｜fim▁begin｜>{prefix}<｜fim▁hole｜>{suffix}<｜fim▁end｜>

Generation Settings

Temperature 0.15: Code completion is precision-driven. Greedy decoding (temp=0.0) can loop; 0.15 offers slight diversity to escape local optima without generating hallucinated syntax.
Max Tokens 64: Inline completions should finish one logical unit (a function call argument, an if block). 64 tokens (~48 characters) is optimal. Longer generation increases latency 3-5x and is rarely accepted by the user entirely.

🚀 Performance Tuning

Context trimming (trimmer.ts): Capped at 100 lines prefix, 30 lines suffix. Every extra token adds latency.
Debouncing (extension.ts): 300ms cooldown. Prevents triggering completions on every keystroke.
Prompt Caching (cache.ts & llama.cpp): cache_prompt: true inside llama.cpp ensures KV cache is reused for repeated prefixes. This is the single biggest latency win.
Thread Tuning: Set --threads equal to your physical cores (e.g., a 4-core/8-hyperthread CPU = 4).

Model Size	Quantization	First Token	Full 64-token
3B	Q4_K_M	~80ms	~400ms
6.7B	Q4_K_M	~180ms	~900ms
7B	Q4_K_M	~200ms	~1000ms
33B	Q4_K_M	~800ms	~4500ms

📂 Project Structure

wisp/
├── install.sh                       ← Automated Linux setup
├── install.ps1                      ← Automated Windows setup
├── start.sh                         ← Service starter script
├── README.md                        ← This documentation
├── middleware/
│   ├── src/
│   │   ├── index.ts                 ← Express server, routing
│   │   ├── completer.ts             ← llama.cpp SSE client
│   │   ├── promptBuilder.ts         ← FIM prompt construction
│   │   ├── trimmer.ts               ← Context window limiter
│   │   └── cache.ts                 ← LRU prompt cache
└── vscode-extension/
    └── src/
        └── extension.ts             ← Provides autocomplete UI logic

🤝 Contributing

Please feel free to contribute! Wisp is an open-source project that currently needs a lot of help to reach its full potential. This project can be 100 times better than it is right now.

Whether you are fixing bugs, optimizing inference performance, or adding new features, all contributions are highly appreciated! Please review our Contributing Guidelines before submitting (remember to keep your PRs neat and attach proofs of your changes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ Wisp

⚡ Quick Start (Automated Installation)

🐧 Linux & 🍎 macOS

🪟 Windows (Experimental)

🛠️ Manual Setup

1. Setup — `llama.cpp`

2. Setup — Middleware

3. Setup — VS Code Extension

🏗️ Architecture

🧠 Prompt Engineering Details

Why FIM (Fill-In-the-Middle)?

Generation Settings

🚀 Performance Tuning

📂 Project Structure

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
llama.cpp		llama.cpp
middleware		middleware
vscode-extension		vscode-extension
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
install.ps1		install.ps1
install.sh		install.sh
start.sh		start.sh

Folders and files

Latest commit

History

Repository files navigation

✨ Wisp

⚡ Quick Start (Automated Installation)

🐧 Linux & 🍎 macOS

🪟 Windows (Experimental)

🛠️ Manual Setup

1. Setup — llama.cpp

2. Setup — Middleware

3. Setup — VS Code Extension

🏗️ Architecture

🧠 Prompt Engineering Details

Why FIM (Fill-In-the-Middle)?

Generation Settings

🚀 Performance Tuning

📂 Project Structure

🤝 Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Setup — `llama.cpp`

Packages