Skip to content

cli: add option to connect to server via http(s)#21674

Draft
pwilkin wants to merge 2 commits intoggml-org:masterfrom
pwilkin:llama-cli-remote
Draft

cli: add option to connect to server via http(s)#21674
pwilkin wants to merge 2 commits intoggml-org:masterfrom
pwilkin:llama-cli-remote

Conversation

@pwilkin
Copy link
Copy Markdown
Member

@pwilkin pwilkin commented Apr 9, 2026

Overview

Adds an --endpoint option to connect to an existing server instance.

Additional information

In many cases, people want to run a llama-server for various uses but also might want a quick test UI in cases where they cannot access the WebUI (i.e. pure console / terminal environments). Since llama-cli spawns a separate server instance, you cannot run both in VRAM-constrained environments, so having the option to run llama-cli with a llama-server endpoint seems desirable.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, although GLM 5.1 generated code with a goto in it, so I had to double-check.

@pwilkin pwilkin requested review from a team and ngxson as code owners April 9, 2026 11:53
Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I'm not quite comfortable with this change. This adds too much for a feature that no one ever asked (via an issue)

If you really need to do this, just code your own CLI thing in higher-level languages like python or nodejs

Comment on lines +12 to +37
struct cli_backend {
virtual ~cli_backend() = default;

// model / server info
virtual std::string get_model_name() const = 0;
virtual bool has_vision() const = 0;
virtual bool has_audio() const = 0;
virtual std::string get_build_info() const = 0;

// chat completion (streaming), returns assistant content text
virtual std::string generate_completion(
const json & messages,
const common_params & params,
bool verbose_prompt,
result_timings & out_timings) = 0;

// load a local text file, return its contents (empty string on failure)
virtual std::string load_text_file(const std::string & fname) = 0;

// load a local media file, return the OAI content part JSON for it
// returns empty JSON object on failure
virtual json load_media_file(const std::string & fname) = 0;

// cleanup
virtual void terminate() = 0;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this will add double the effort each time someone adds a new feature to the CLI

Not a wise choice for long-term maintenance. The CLI should either support native API or remote API, but not both

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I really do feel like having the remote API as the only one would be the better option. As in: it would add interoperability, it would make it simpler to implement the MCP / command execution stuff and it would remove the need to keep a separate track for accessing the server. And all it would take to retain the current functionality of launching the client and the server at the same time would be a simple wrapper.

Putting this up for consideration and converting this to draft for now.

@pwilkin pwilkin marked this pull request as draft April 9, 2026 16:29
@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Apr 9, 2026

@ngxson since we don't want double APIs, what do you think of a prototype here that does the following:

  • removes the cli-specific path
  • migrates everything to the http path
  • if run without --endpoint, launches a server on a random port that is closed as soon as the cli exits to mimic the previous cli functionality
    ?

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 9, 2026

Honestly I don't have a strong opinion on whether the CLI should use native API, HTTP API or another IPC mechanism like unix socket. However, since most LLM CLI uses HTTP API under the hood, I agree that it may be better in the longer term to go with that for llama-cli.

I do have 2 concerns though:

  1. Currently, the CLI acts as an example on how easy it is to use llama.cpp as an external library (via binding; without being a HTTP server). If we choose to move CLI away from this, we must still need to add an example of doing so (though, can be much more basic than CLI)
  2. If we choose to use HTTP for CLI, we should no longer link CLI against libserver. The consequence is that CLI must either spawn an llama-server instance, or llama-server should be a daemon.

For the point (1), no actions is needed from your side, I will eventually implement it (which goes back to the idea of llamax library), there are many people already asking for a easy-to-use native API that accepts multimodal. However, for the point (2), I think we need to consider it more carefully.

@pwilkin
Copy link
Copy Markdown
Member Author

pwilkin commented Apr 9, 2026

@ngxson cross-platform daemon management can get really tricky, so I'd prefer not to go that route. I'd say spawning a llama-server instance that gets shut down at cli close would be the preferred way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants