Add HuggingFace Inference API support

### Is this a new feature, an improvement, or a change to existing functionality?

New Feature

### How would you describe the priority of this feature request

Medium

### Please provide a clear description of problem this feature solves

The current huggingface LLM provider only supports local model execution via transformers. Extending the provider to support remote inference enables connecting to HuggingFace's Serverless Inference API, dedicated Inference Endpoints, and self-hosted TGI servers. This allows workflows to use HuggingFace models without requiring local GPU resources.

### Describe your ideal solution

Create a new huggingface_inference LLM provider type with its own config and client implementation. The config should include model_id, api_key, and endpoint_url fields to support connecting to HuggingFace's Serverless Inference API, dedicated Inference Endpoints, and self-hosted TGI servers. The provider should use huggingface_hub.InferenceClient for HTTP-based inference. A LangChain-compatible wrapper class should be added along with the corresponding client registration in the nvidia_nat_langchain plugin.

Config:
```python
class HuggingFaceInferenceConfig(LLMBaseConfig, name="huggingface_inference"):
    model_id: str = Field(description="HuggingFace model ID")
    api_key: OptionalSecretStr = Field(default=None, description="HuggingFace API token")
    endpoint_url: str | None = Field(default=None, description="TGI or Inference Endpoint URL")
    max_new_tokens: int = Field(default=128)
    temperature: float = Field(default=0.0)
```
Provider:
```python
@register_llm_provider(config_type=HuggingFaceInferenceConfig)
async def huggingface_inference_provider(config, builder):
    yield LLMProviderInfo(config=config, description=f"HuggingFace Inference: {config.model_id}")
```
Client:
```python
class HuggingFaceInferenceModel(BaseChatModel):
    """LangChain-compatible wrapper for HuggingFace Inference API."""
    
    def __init__(self, config: HuggingFaceInferenceConfig):
        from huggingface_hub import InferenceClient
        self._client = InferenceClient(
            model=config.model_id, 
            token=config.api_key, 
            base_url=config.endpoint_url
        )
        self._config = config

    async def _agenerate(self, messages, **kwargs):
        response = self._client.chat_completion(messages=messages, ...)
        # ... return ChatResult
```

### Additional context

_No response_

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct
- [x] I have searched the [open feature requests](https://github.com/NVIDIA/NeMo-Agent-Toolkit/issues?q=is%3Aopen+is%3Aissue+label%3A%22feature+request%22%2Cimprovement%2Cenhancement) and have found no duplicates for this feature request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HuggingFace Inference API support #1354

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Additional context

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add HuggingFace Inference API support #1354

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Additional context

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions