Introduction

llama を試してみて思うのは、コンソールベースで動作しているため、プロダクトでは使いにくい点。
実際の ChatGPT みたいに API で提供していたら使いやすいので、そういうものがないかを調べてみた。

What options are there?

Python Bindings for llama.cpp

llama.cpp の Python バインディングである Python Bindings for llama.cpp
Python をつかったサーバーモジュールが提供されている。
(実は、llama.cpp にもサーバーアプリケーションは存在するが、こちらのが導入は簡単)

現時点で Python 3.8 以上が必要。

Install

1
2
$ python -m pip install pip --upgrade
$ python -m pip install llama-cpp-python[server]

GPU を使う場合は少し手間がかかる

Windows

--verbose をつけることで、CUDA が適切に設定されているかなどのビルド状況を可視化できる。

1
2
3
$ set CMAKE_ARGS=-DLLAMA_CUBLAS=on
$ set FORCE_CMAKE=1
$ python -m pip install llama-cpp-python[server] --verbose

Run

何はともかくモデルファイルが必要。
モデルファイルは HuggingFace からダウンロードする。
公式の llama のモデルではなく、GGUF 形式にしたモデルが必要。
GGUF への変換は convert.py で実行できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
$ curl -LO https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
$ python -m llama_cpp.server --model llama-2-7b-chat.Q4_K_M.gguf
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: system memory used = 3891.35 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 159.19 MiB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
INFO: Started server process [13656]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

こんな感じでサーバーが起動する。

GPU を有効にしてインストールし、GPU を使いたい場合は --n_gpu_layers を使用する。
数値は 0 が既定値で、-1 を指定するとモデルの全てのレイヤーを GPU に乗せる。

1
$ python -m llama_cpp.server --model llama-2-7b-chat.Q4_K_M.gguf --n_gpu_layers -1

また、Swagger のインターフェースもある。
http://localhost:8000/docs にアクセスすることで下記のようなページが見える。

Safari

そのため、curl を使って質問を投げることができる

Windows
1
2
$ curl http://localhost:8000/v1/chat/completions -X POST -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"messages\": [ { \"content\": \"What is the capital of France?\", \"role\": \"user\" } ] }"
{"id":"chatcmpl-c2d4378c-f4a4-4fdb-ae08-e69f08553f61","object":"chat.completion","created":1703959574,"model":"llama-2-7b-chat.Q4_K_M.gguf","choices":[{"index":0,"message":{"content":" The capital of France is Paris. French people also refer to it as the \"City of Light\" because of its rich history, culture, and architecture. It is located in the northern central part of France and is home to many famous landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. The capital of France has a population of around 2 million people and is known for its vibrant fashion industry, art galleries, and cultural events.","role":"assistant"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":103,"total_tokens":121}}
Linux
1
2
$ curl -X POST http://localhost:8000/v1/chat/completions -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "messages": [ { "content": "What is the capital of France?", "role": "user" } ] }'
{"id":"chatcmpl-a1eee142-20e4-4ca2-bb35-04ada9667d4b","object":"chat.completion","created":1703959761,"model":"llama-2-7b-chat.Q4_K_M.gguf","choices":[{"index":0,"message":{"content":" The capital of France is Paris. The city of Paris is located in the northern central part of France and is known for its iconic landmarks such as the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum. It is also home to many cultural institutions, including the Sorbonne University and the Palace of Versailles. The city has a population of around 2 million people and is considered one of the most popular tourist destinations in Europe.","role":"assistant"},"finish_reason":"stop"}],"usage":{"prompt_tokens":18,"completion_tokens":100,"total_tokens":118}}

Note

複数クライアントからの同時実行はサポートされていない。
Is the server can be run with multiple states? #257

メモリにロードしたモデルに対して forward/backward するのが DeepLearning なのでそれは当然。