-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Your able to keep up with ggerganov who updates quite quickly. In the c# world we have limited choices for fast minimalist to interface to ggerganov's work. Even ggerganov builds with 12.2 cuda, instead of the latest 12.6. So unless we completely build bare metal rather than rely on prebuilt librarys. We will always be behind the 8ball for quite awhile. Other c# bridge lib's impose excessive abstractions, bulk it out with 3rd party tools. and because of this update quite slowly.
The minimalist approach you take keeps things cutting edge. One of your best techniques was to keep the inference loop in a separate worker thread. and enqueue/dequeue it rather than tight integration with asyncenumerators to the client. This makes it clean and allows many other libs to work in parallel such as stableDiffusion.net. Not to mention the clean make/building toolchain between ggerganov you setup for a quick pull and build all which encompasses cuda toolkit changes in a one click solution. This usually requires 3 separate steps to baremetal compile all the associated libraries.
My wish though is you include some of the special higher level features of ggerganov has in his cli. I have tried to get the cache load/store to work. But it requires some knowledge of how to deal with unsafe pointers, and globalheapmemory beyond just calling the bindings. I know this probably just takes 8-10 lines of code. But apparently is beyond me, and a separate tokenizing step to save the tokens which are usually handled in that worker thread.
I do not forsee ggerganov creating a middleware stack that will be as useful as your current system. And I would like to keep the inference loop as you have it. Just want the cache load/save to function. Perhaps call this your middleware stack rather than just some code snippets in your example client examples. adding this, plus, embedding, quantization's, and other things that is supported by ggerganov's cli tool. It would still be a minimalist approach with just this extra toolkit.