Replicating OpenAI API for Llama, Alpaca or any animal-shaped LLM

What’s up with AI nowadays?

You’ve probably read all the news, memes, LinkedIn posts about ChatGPT and GPT4. Your favorite crypto bro who was all in about NFT last year? He is now telling you that AI is the future! Well, that might be true, so you’re obviously super excited. Maybe the future is indeed waitlists, forms, and black boxes. So yes, the OpenAI’s API looks neat and you are considering it.

But it doesn’t have to be only that. For several reasons:

Maybe, rather than relying on the good will of huge for-profit organisations, you want to build, host, and deploy your own models, fine tuned on your niche use cases. Or your data is maybe too sensitive to leave your servers?
Maybe you absolutely love the idea of an AI-as-a-service, and are completely fine with paying for it. It is indeed a valuable service. Yet you are suspicious and cheap, and might want to benchmark it against some baseline model.
Maybe you’ve developped your own model, and want a drop-in replacement to switch to yours, while being able to A/B test the two solutions easily.
Maybe your use case implies an unreliable connection and you want a fallback.

So what would be nice here would be to have some kind of self hosted version of it. Let’s see what we can do?

What are we looking for?

So if you go on OpenAI’s documentation, you’ll see their API is quite simple and straightforward. You can address the main use cases for LLM (and diffusion models) with their endpoints, in particular:

/models and /models/{models} allow you to list available models and get basic infos about them
Language models are basically great token prediction models, so here we are with a /completions endpoint doing exactly that.
/chat/completions is doing quite the same thing but for chats, using GPT-3.5-Turbo aka “chatGPT”.
/edit is a super cool one, you give instructions, an optional input to use, and get the result (something like “Correct the spelling in ‘cahtGtp is col’").
/embeddings to generate an embedding to represent your input text. Quite useful for many downstream tasks.

It’s quite simple to reproduce the logic and build something from there. You can use your favorite API framework, for instance the excellent FastAPI.

Introducing the SimpleAI project

Well good news, I’ve done this so you don’t have to. Code and docs are here, and installing it should be as easy as:

1
 pip install simple_ai_server

Ok, now the best part: you can start your own server in 2 lines.

1
2
simple_ai init
simple_ai serve

Well, kind of. You still have to add your own models, this just starts a functional yet empty server. But that’s where the fun begins. I’ve been using gRPC (and Docker) in the backend, so you can easily plug in your model, using the language of your choice.

The motivations for this were:

I wanted to have some kind of separation between the API logic and the model itself, for flexibility and maintenance, so running models “somewhere else” and using an interface such as gRPC made sense.
I also wanted to have some kind of scalability, so having a lightweight API server being the interface to an arbitrary set of models was appealing. We can imagine having one instance of the server communicating with N containers / servers and using a load balancer or a cache, depending on our needs.
Likewise, I didn’t want to be limited to one language. Even if Python is the usual suspect for deep learning, we’ve seen cool stuffs in different languages. For instance, llama.cpp is C++ based and allow us to run Llama on CPU.

Using gRPC made sense to me, but the project is easily extendable to other protocols such as a REST API, or a Kafka pub/sub. And if you’re primarily working with Python and are afraid of protobuf files and gRPC, it’s still relatively straightforward to use thanks to the package provided classes and functions. Using:

1
2
3
4
from simple_ai.serve.python.completion.server import (
  serve, 
  LanguageModelServicer, LanguageModel
)

You should be covered, as you can see in the examples.

An example

You can start with this example, for a quick integration of an Alpaca-LoRA model. If you have a recent nvidia GPU, it’s as easy as first building image:

1
docker build . -t alpaca-7b-service:latest

Note: you might want to increase the capacity for Docker, as the resulting image is quite large. Add "storage-opt": ["size=60G"] in your docker engine configuration.

Then adding it to your generated models.toml configuration file:

1
2
3
4
5
6
7
[alpaca-lora-7b]
    [alpaca-lora-7b.metadata]
        owned_by    = 'Tloen <[github.com/tloen](https://github.com/tloen)>'
        permission  = []
        description = 'Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning, reproducing the Stanford Alpaca results using low-rank adaptation (LoRA). It provides an Instruct model of similar quality to OpenAI's text-davinci-003'
    [alpaca-lora-7b.network]
        url = 'localhost:50051'

And finally, starting a container:

1
docker run -it --rm --p 50051:50051 --gpus all alpaca-7b-service:latest

You should now be able to use the API to help you send a nice message to your colleagues:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
curl -X 'POST' \
  'http://127.0.0.1:8080/edits/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "alpaca-lora-7B",
  "instruction": "Make this message nicer and more formal",
  "input": "This meeting was useless and should have been a bloody email",
  "top_p": 1,
  "n": 1,
  "temperature": 1,
  "max_tokens": 256
}'

You should get some output like that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "id": "19fc9e53-74e7-43cf-ab76-6816c8756a75",
  "object": "edit",
  "created": 1679343430,
  "model": "alpaca-lora-7B",
  "choices": [
    {
      "text": "This meeting was unproductive and should have been conducted via email.",
      "index": 0
    }
  ]
}

Or even if you feel like it, write haiku about your last drink:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
curl -X 'POST' \
  'http://127.0.0.1:8080/edits/' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "alpaca-lora-7B",
  "instruction": "write a Haiku about espresso",
  "input": "",
  "top_p": 1,
  "n": 1,
  "temperature": 1,
  "max_tokens": 256
}'

Not bad for an alpaca:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "id": "ef84b838-0532-455e-bbc4-8385db5cbd5d",
  "object": "edit",
  "created": 1679343430,
  "model": "alpaca-lora-7B",
  "choices": [
    {
      "text": "Coffee's aroma fills the air,\nEspresso's aroma fills my heart.",
      "index": 0
    }
  ]
}

“Coffee’s aroma fills the air,

Espresso’s aroma fills my heart.”

I am not sure I could have come up with something better, but it says more about my writing skills than where we are on our path to AGI.

Bonus: if you are using the Python OpenAI package

OpenAI package use openai.api_base that can be changed to your instance base URL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import openai

# Put anything you want in `API key`
openai.api_key = 'Free the models'

# Point to your own url
openai.api_base = "http://127.0.0.1:8080"

# Do your usual things, for instance a completion query:
print(openai.Model.list())
completion = openai.Completion.create(model="llama-7B", prompt="Hello everyone this is")

And that’s it.

What is next?

Well, what is next will depend on you! I’m sure this could be used for a lot of great use cases. I have mine already, but it would be cool to know if/how it’s useful to others.

This project has been fun and a great learning experience. I’ll keep contributing on it, some parts are messy, and if you have any feedback I’d be happy to have a chat about it.

It’s standing on the shoulders of giants and none of this would be possible without the incredible contributions of many to the open source and research communities. So thanks to anyone contributing to these projects!