diff --git a/README.md b/README.md index 8e0c0b069..af7a5b46f 100644 --- a/README.md +++ b/README.md @@ -60,6 +60,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用 ## 🎉 News +- 2023.12.18: Support for **VLLM** for inference acceleration and deployment. For more details, refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md). - 2023.12.15: Support **deepseek**, **deepseek-coder** series: deepseek-7b, deepseek-7b-chat, deepseek-67b, deepseek-67b-chat, openbuddy-deepseek-67b-chat, deepseek-coder-1_3b, deepseek-coder-1_3b-chat, deepseek-coder-6_7b, deepseek-coder-6_7b-chat, deepseek-coder-33b, deepseek-coder-33b-chat. - 2023.12.13: Support mistral-7b-chat-v2, [mixtral-7b-moe](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe), [mixtral-7b-moe-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe_chat). - 2023.12.9: Support the `freeze_parameters` parameter as a compromise between LoRA and full parameter. Corresponding shell scripts can be found at [full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp). Support `disable_tqdm`, `lazy_tokenize`, `preprocess_num_proc` parameters, for details please refer to [Command-Line parameters](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md). @@ -102,6 +103,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用 - **Self-cognitionfine-tuning** for large models in **10 minutes**, creating a personalized large model, please refer to [Best Practices for Self-cognition Fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自我认知微调最佳实践.md). - Quickly perform **inference** on LLM and build a **Web-UI**, see the [LLM Inference Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md). - Rapidly **fine-tune** and perform inference on LLM, and build a Web-UI. See the [LLM Fine-tuning Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md). +- Utilize VLLM for **inference acceleration** and **deployment**. Please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md) for more information. - View the models and datasets supported by Swift. You can check [supported models and datasets](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md). - Expand and customize models, datasets, and dialogue templates in Swift, see [Customization and Expansion](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md). - Check command-line parameters for fine-tuning and inference, see [Command-Line parameters](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md). diff --git a/README_CN.md b/README_CN.md index c550a6671..0d6a213e3 100644 --- a/README_CN.md +++ b/README_CN.md @@ -58,6 +58,7 @@ SWIFT(Scalable lightWeight Infrastructure for Fine-Tuning)是一个可扩展 用户可以查看 [SWIFT官方文档](docs/source/GetStarted/快速使用.md) 来了解详细信息。 ## 🎉 新闻 +- 2023.12.18: 支持**VLLM**进行推理加速和部署. 具体可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md). - 2023.12.15: 支持**deepseek**, **deepseek-coder**系列: deepseek-7b, deepseek-7b-chat, deepseek-67b, deepseek-67b-chat, openbuddy-deepseek-67b-chat, deepseek-coder-1_3b, deepseek-coder-1_3b-chat, deepseek-coder-6_7b, deepseek-coder-6_7b-chat, deepseek-coder-33b, deepseek-coder-33b-chat. - 2023.12.13: 支持mistral-7b-chat-v2, [mixtral-7b-moe](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe), [mixtral-7b-moe-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe_chat). - 2023.12.9: 支持`freeze_parameters`参数, 作为lora和全参数训练的折中方案. 对应的sh可以查看[full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp). 支持`disable_tqdm`, `lazy_tokenize`, `preprocess_num_proc`参数, 具体可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md). @@ -100,6 +101,7 @@ SWIFT(Scalable lightWeight Infrastructure for Fine-Tuning)是一个可扩展 - **10分钟**对大模型进行**自我认知微调**, 创建专属于自己的大模型, 可以查看[自我认知微调最佳实践](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自我认知微调最佳实践.md). - 快速对LLM进行**推理**, 搭建**Web-UI**, 可以查看[LLM推理文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md). - 快速对LLM进行**微调**, 推理并搭建Web-UI. 可以查看[LLM微调文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md). +- 使用VLLM进行**推理加速**和**部署**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md). - 查看swift支持的模型和数据集. 可以查看[支持的模型和数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md). - 对swift中的模型, 数据集, 对话模板进行**拓展**, 可以查看[自定义与拓展](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md). - 查询微调和推理的命令行参数, 可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md). @@ -333,6 +335,7 @@ output 本项目使用[Apache License (Version 2.0)](https://github.com/modelscope/modelscope/blob/master/LICENSE)进行许可。 + ## ☎ 联系我们 您可以通过加我们的微信群, 来和我们联系和交流: diff --git a/docs/source/LLM/LLM微调文档.md b/docs/source/LLM/LLM微调文档.md index a194a995a..ffa72c5d6 100644 --- a/docs/source/LLM/LLM微调文档.md +++ b/docs/source/LLM/LLM微调文档.md @@ -222,6 +222,8 @@ swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx' ``` ## 推理 +如果你要使用VLLM进行推理加速, 可以查看[VLLM推理加速与部署](./VLLM推理加速与部署.md#微调后的模型) + ### 原始模型 **单样本推理**可以查看[LLM推理文档](./LLM推理文档.md#-推理) @@ -230,7 +232,7 @@ swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx' CUDA_VISIBLE_DEVICES=0 swift infer --model_id_or_path qwen/Qwen-7B-Chat --dataset blossom-math-zh ``` ### 微调后模型 -**单样本推理** +**单样本推理**: 使用LoRA**增量**权重进行推理: ```python @@ -241,13 +243,12 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type ) from swift.tuners import Swift -import torch model_dir = 'vx_xxx/checkpoint-100' model_type = ModelType.qwen_7b_chat template_type = get_default_template_type(model_type) -model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'}) +model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}) model = Swift.from_pretrained(model, model_dir, inference_mode=True) template = get_template(template_type, tokenizer) @@ -265,13 +266,12 @@ os.environ['CUDA_VISIBLE_DEVICES'] = '0' from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type ) -import torch model_dir = 'vx_xxx/checkpoint-100-merged' model_type = ModelType.qwen_7b_chat template_type = get_default_template_type(model_type) -model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'}, +model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, model_dir=model_dir) template = get_template(template_type, tokenizer) @@ -292,6 +292,8 @@ CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged' ``` ## Web-UI +如果你要使用VLLM进行部署并提供**API**接口, 可以查看[VLLM推理加速与部署](./VLLM推理加速与部署.md#部署) + ### 原始模型 使用原始模型的web-ui可以查看[LLM推理文档](./LLM推理文档.md#-Web-UI) diff --git a/docs/source/LLM/LLM推理文档.md b/docs/source/LLM/LLM推理文档.md index 0c0d734bf..210320e8f 100644 --- a/docs/source/LLM/LLM推理文档.md +++ b/docs/source/LLM/LLM推理文档.md @@ -1,4 +1,6 @@ # LLM推理文档 +如果你要使用vllm进行推理加速, 可以查看[VLLM推理加速与部署](./VLLM推理加速与部署.md#推理加速) + ## 目录 - [环境准备](#环境准备) - [推理](#推理) @@ -34,7 +36,6 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.qwen_7b_chat template_type = get_default_template_type(model_type) @@ -44,7 +45,7 @@ print(f'template_type: {template_type}') # template_type: chatml kwargs = {} # kwargs['use_flash_attn'] = True # 使用flash_attn -model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'}, **kwargs) +model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs) # 修改max_new_tokens model.generation_config.max_new_tokens = 128 @@ -97,7 +98,6 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.qwen_7b_chat_int4 template_type = get_default_template_type(model_type) @@ -135,13 +135,12 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.qwen_7b template_type = get_default_template_type(model_type) print(f'template_type: {template_type}') # template_type: default-generation -model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'}) +model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}) model.generation_config.max_new_tokens = 64 template = get_template(template_type, tokenizer) seed_everything(42) @@ -177,7 +176,6 @@ from swift.llm import ( get_model_tokenizer, get_template, inference_stream, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.qwen_7b_chat template_type = get_default_template_type(model_type) @@ -219,7 +217,6 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.qwen_vl_chat template_type = get_default_template_type(model_type) @@ -262,7 +259,6 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.qwen_audio_chat template_type = get_default_template_type(model_type) @@ -304,7 +300,6 @@ from swift.llm import ( get_model_tokenizer, get_template, inference, ModelType, get_default_template_type, ) from swift.utils import seed_everything -import torch model_type = ModelType.chatglm3_6b template_type = get_default_template_type(model_type) @@ -430,7 +425,7 @@ app_ui_main(infer_args) ### qwen-7b 使用CLI: ```bash -swift app-ui --model_id_or_path qwen/Qwen-7B +CUDA_VISIBLE_DEVICES=0 swift app-ui --model_id_or_path qwen/Qwen-7B ``` 使用python: diff --git a/docs/source/LLM/VLLM推理加速与部署.md b/docs/source/LLM/VLLM推理加速与部署.md new file mode 100644 index 000000000..8af042bb6 --- /dev/null +++ b/docs/source/LLM/VLLM推理加速与部署.md @@ -0,0 +1,219 @@ + +# VLLM推理加速与部署 + +## 目录 +- [环境准备](#环境准备) +- [推理加速](#推理加速) +- [Web-UI加速](#web-ui加速) +- [部署](#部署) + +## 环境准备 +GPU设备: A10, 3090, V100, A100均可. +```bash +# 设置pip全局镜像 +pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ +# 安装ms-swift +git clone https://github.com/modelscope/swift.git +cd swift +pip install -e .[llm] + +# vllm与cuda版本有对应关系,请按照`https://docs.vllm.ai/en/latest/getting_started/installation.html`选择版本 +pip install vllm -U + +# 如果你想要使用基于auto_gptq的模型进行推理. +# 使用auto_gptq的模型: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md#模型` +# auto_gptq和cuda版本有对应关系,请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本 +pip install auto_gptq -U + +# 环境对齐 (如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试) +pip install -r requirements/framework.txt -U +pip install -r requirements/llm.txt -U +``` + +## 推理加速 + +### qwen-7b-chat +```python +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + +from swift.llm import ( + ModelType, get_vllm_engine, get_default_template_type, + get_template, inference_vllm +) + +model_type = ModelType.qwen_7b_chat +llm_engine = get_vllm_engine(model_type) +template_type = get_default_template_type(model_type) +template = get_template(template_type, llm_engine.tokenizer) +# 与`transformers.GenerationConfig`类似的接口 +llm_engine.generation_config.max_new_tokens = 256 + +request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}] +resp_list = inference_vllm(llm_engine, template, request_list) +for request, resp in zip(request_list, resp_list): + print(f"query: {request['query']}") + print(f"response: {resp['response']}") + +history1 = resp_list[1]['history'] +request_list = [{'query': '这有什么好吃的', 'history': history1}] +resp_list = inference_vllm(llm_engine, template, request_list) +for request, resp in zip(request_list, resp_list): + print(f"query: {request['query']}") + print(f"response: {resp['response']}") + print(f"history: {resp['history']}") + +"""Out[0] +query: 你好! +response: 你好!很高兴为你服务。有什么我可以帮助你的吗? +query: 浙江的省会在哪? +response: 浙江省会是杭州市。 +query: 这有什么好吃的 +response: 杭州是一个美食之城,拥有许多著名的菜肴和小吃,例如西湖醋鱼、东坡肉、叫化童子鸡等。此外,杭州还有许多小吃店,可以品尝到各种各样的本地美食。 +history: [('浙江的省会在哪?', '浙江省会是杭州市。'), ('这有什么好吃的', '杭州是一个美食之城,拥有许多著名的菜肴和小吃,例如西湖醋鱼、东坡肉、叫化童子鸡等。此外,杭州还有许多小吃店,可以品尝到各种各样的本地美食。')] +""" +``` + +### 流式输出 +```python +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + +from swift.llm import ( + ModelType, get_vllm_engine, get_default_template_type, + get_template, inference_stream_vllm +) + +model_type = ModelType.qwen_7b_chat +llm_engine = get_vllm_engine(model_type) +template_type = get_default_template_type(model_type) +template = get_template(template_type, llm_engine.tokenizer) +# 与`transformers.GenerationConfig`类似的接口 +llm_engine.generation_config.max_new_tokens = 256 + +request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}] +gen = inference_stream_vllm(llm_engine, template, request_list) +query_list = [request['query'] for request in request_list] +print(f"query_list: {query_list}") +for resp_list in gen: + response_list = [resp['response'] for resp in resp_list] + print(f'response_list: {response_list}') + +history1 = resp_list[1]['history'] +request_list = [{'query': '这有什么好吃的', 'history': history1}] +gen = inference_stream_vllm(llm_engine, template, request_list) +query = request_list[0]['query'] +print(f"query: {query}") +for resp_list in gen: + response = resp_list[0]['response'] + print(f'response: {response}') + +history = resp_list[0]['history'] +print(f'history: {history}') + +"""Out[0] +query_list: ['你好!', '浙江的省会在哪?'] +... +response_list: ['你好!很高兴为你服务。有什么我可以帮助你的吗?', '浙江省会是杭州市。'] +query: 这有什么好吃的 +... +response: 杭州是一个美食之城,拥有许多著名的菜肴和小吃,例如西湖醋鱼、东坡肉、叫化童子鸡等。此外,杭州还有许多小吃店,可以品尝到各种各样的本地美食。 +history: [('浙江的省会在哪?', '浙江省会是杭州市。'), ('这有什么好吃的', '杭州是一个美食之城,拥有许多著名的菜肴和小吃,例如西湖醋鱼、东坡肉、叫化童子鸡等。此外,杭州还有许多小吃店,可以品尝到各种各样的本地美食。')] +""" +``` + +### chatglm3 +```python +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + +from swift.llm import ( + ModelType, get_vllm_engine, get_default_template_type, + get_template, inference_vllm +) + +model_type = ModelType.chatglm3_6b +llm_engine = get_vllm_engine(model_type) +template_type = get_default_template_type(model_type) +template = get_template(template_type, llm_engine.tokenizer) +# 与`transformers.GenerationConfig`类似的接口 +llm_engine.generation_config.max_new_tokens = 256 + +request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}] +resp_list = inference_vllm(llm_engine, template, request_list) +for request, resp in zip(request_list, resp_list): + print(f"query: {request['query']}") + print(f"response: {resp['response']}") + +history1 = resp_list[1]['history'] +request_list = [{'query': '这有什么好吃的', 'history': history1}] +resp_list = inference_vllm(llm_engine, template, request_list) +for request, resp in zip(request_list, resp_list): + print(f"query: {request['query']}") + print(f"response: {resp['response']}") + print(f"history: {resp['history']}") + +"""Out[0] +query: 你好! +response: 您好,我是人工智能助手。很高兴为您服务!请问有什么问题我可以帮您解答? +query: 浙江的省会在哪? +response: 浙江的省会是杭州。 +query: 这有什么好吃的 +response: 浙江有很多美食,其中一些非常有名的包括杭州的龙井虾仁、东坡肉、西湖醋鱼、叫化童子鸡等。另外,浙江还有很多特色小吃和糕点,比如宁波的汤团、年糕,温州的炒螃蟹、温州肉圆等。 +history: [('浙江的省会在哪?', '浙江的省会是杭州。'), ('这有什么好吃的', '浙江有很多美食,其中一些非常有名的包括杭州的龙井虾仁、东坡肉、西湖醋鱼、叫化童子鸡等。另外,浙江还有很多特色小吃和糕点,比如宁波的汤团、年糕,温州的炒螃蟹、温州肉圆等。')] +""" +``` + +### 微调后的模型 + +**单样本推理**: + +使用LoRA进行微调的模型你需要先[merge-lora](./LLM微调文档.md#merge-lora), 产生完整的checkpoint目录. + +使用全参数微调的模型可以无缝使用VLLM进行推理加速. +```python +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + +from swift.llm import ( + ModelType, get_vllm_engine, get_default_template_type, + get_template, inference_vllm +) +from swift.tuners import Swift + +model_dir = 'vx_xxx/checkpoint-100-merged' +model_type = ModelType.qwen_7b_chat +template_type = get_default_template_type(model_type) + +llm_engine = get_vllm_engine(model_type, model_dir=model_dir) +tokenizer = llm_engine.tokenizer +template = get_template(template_type, tokenizer) +query = '你好' +resp = inference_vllm(llm_engine, template, [{'query': query}])[0] +print(f"response: {resp['response']}") +print(f"history: {resp['history']}") +``` + +使用**数据集**评估: +```bash +# merge LoRA增量权重并使用vllm进行推理加速 +swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx' +CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged' --infer_backend vllm +``` + +## Web-UI加速 + +### 原始模型 +```bash +CUDA_VISIBLE_DEVICES=0 swift app-ui --model_id_or_path qwen/Qwen-7B-Chat --infer_backend vllm +``` + +### 微调后模型 +```bash +# merge LoRA增量权重并使用vllm作为backend构建app-ui +swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx' +CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged' --infer_backend vllm +``` + +## 部署 +TODO diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md index 7514fd947..99ec009f2 100644 --- a/docs/source/LLM/命令行参数.md +++ b/docs/source/LLM/命令行参数.md @@ -91,6 +91,7 @@ - `--model_cache_dir`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. - `--sft_type`: 默认值为`'lora'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看. - `--template_type`: 默认值为`'AUTO'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看. +- `--infer_backend`: 你可以选择'AUTO', 'vllm', 'pt'. 默认使用'AUTO', 进行智能选择, 即如果没有传入`ckpt_dir`或使用全参数微调, 并且安装了vllm且模型支持vllm则使用vllm引擎, 否则使用原生torch进行推理. vllm环境准备可以参考[VLLM推理加速与部署](./VLLM推理加速与部署.md#环境准备). - `--ckpt_dir`: 必填项, 值为SFT阶段保存的checkpoint路径, e.g. `'/path/to/your/vx_xxx/checkpoint-xxx'`. - `--load_args_from_ckpt_dir`: 是否从`ckpt_dir`的`sft_args.json`文件中读取配置信息. 默认是`True`. - `--load_dataset_config`: 该参数只有在`--load_args_from_ckpt_dir true`时才生效. 即是否从`ckpt_dir`的`sft_args.json`文件中读取数据集相关的配置信息. 默认为`True`. @@ -125,3 +126,5 @@ - `--overwrite_generation_config`: 是否将评估所使用的generation_config保存成`generation_config.json`文件, 默认为`False`. 训练时保存的generation_config文件将被覆盖. - `--verbose`: 如果设置为False, 则使用tqdm样式推理. 如果设置为True, 则输出推理的query, response, label. 默认为`None`, 进行自动选择, 即`len(val_dataset) >= 100`时, 设置为False, 否则设置为True. 该参数只有在`--eval_human false`时才生效. - `--share`: 传递给gradio的`demo.queue().launch(...)`函数. 该参数只有在使用`app-ui`时才生效. +- `--gpu_memory_utilization`: 初始化vllm引擎`EngineArgs`的参数, 默认为`0.9`. 该参数只有在使用vllm时才生效. +- `--tensor_parallel_size`: 初始化vllm引擎`EngineArgs`的参数, 默认为`1`. 该参数只有在使用vllm时才生效. diff --git a/docs/source/LLM/支持的模型和数据集.md b/docs/source/LLM/支持的模型和数据集.md index 64a4c3600..b45b2369c 100644 --- a/docs/source/LLM/支持的模型和数据集.md +++ b/docs/source/LLM/支持的模型和数据集.md @@ -8,105 +8,106 @@ - Model List: 模型在swift中注册的model_type的列表. - Default Lora Target Modules: 对应模型的默认lora_target_modules. - Default Template: 对应模型的默认template. -- Support Flash Attn: 模型是否支持[flash attention](https://github.com/Dao-AILab/flash-attention). +- Support Flash Attn: 模型是否支持[flash attention](https://github.com/Dao-AILab/flash-attention)加速推理和微调. +- Support VLLM: 模型是否支持[vllm](https://github.com/vllm-project/vllm)加速推理和部署. - Requires: 对应模型所需的额外依赖要求. -| Model Type | Model ID | Default Lora Target Modules | Default Template | Support Flash Attn | Requires | -| --------- | -------- | --------------------------- | ---------------- | ------------------ | -------- | -|qwen-1_8b|[qwen/Qwen-1_8B](https://modelscope.cn/models/qwen/Qwen-1_8B/summary)|c_attn|default-generation|✔|| -|qwen-1_8b-chat|[qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary)|c_attn|chatml|✔|| -|qwen-1_8b-chat-int4|[qwen/Qwen-1_8B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-1_8b-chat-int8|[qwen/Qwen-1_8B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-7b|[qwen/Qwen-7B](https://modelscope.cn/models/qwen/Qwen-7B/summary)|c_attn|default-generation|✔|| -|qwen-7b-chat|[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary)|c_attn|chatml|✔|| -|qwen-7b-chat-int4|[qwen/Qwen-7B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-7b-chat-int8|[qwen/Qwen-7B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-14b|[qwen/Qwen-14B](https://modelscope.cn/models/qwen/Qwen-14B/summary)|c_attn|default-generation|✔|| -|qwen-14b-chat|[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary)|c_attn|chatml|✔|| -|qwen-14b-chat-int4|[qwen/Qwen-14B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-14b-chat-int8|[qwen/Qwen-14B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-72b|[qwen/Qwen-72B](https://modelscope.cn/models/qwen/Qwen-72B/summary)|c_attn|default-generation|✔|| -|qwen-72b-chat|[qwen/Qwen-72B-Chat](https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary)|c_attn|chatml|✔|| -|qwen-72b-chat-int4|[qwen/Qwen-72B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-72b-chat-int8|[qwen/Qwen-72B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-vl|[qwen/Qwen-VL](https://modelscope.cn/models/qwen/Qwen-VL/summary)|c_attn|default-generation|✔|| -|qwen-vl-chat|[qwen/Qwen-VL-Chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary)|c_attn|chatml|✔|| -|qwen-vl-chat-int4|[qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|qwen-audio|[qwen/Qwen-Audio](https://modelscope.cn/models/qwen/Qwen-Audio/summary)|c_attn|default-generation|✔|| -|qwen-audio-chat|[qwen/Qwen-Audio-Chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary)|c_attn|chatml|✔|| -|chatglm2-6b|[ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b/summary)|query_key_value|chatglm2|✘|| -|chatglm2-6b-32k|[ZhipuAI/chatglm2-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm2-6b-32k/summary)|query_key_value|chatglm2|✘|| -|chatglm3-6b-base|[ZhipuAI/chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/summary)|query_key_value|chatglm-generation|✘|| -|chatglm3-6b|[ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary)|query_key_value|chatglm3|✘|| -|chatglm3-6b-32k|[ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary)|query_key_value|chatglm3|✘|| -|llama2-7b|[modelscope/Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|llama2-7b-chat|[modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|✔|| -|llama2-13b|[modelscope/Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|llama2-13b-chat|[modelscope/Llama-2-13b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|✔|| -|llama2-70b|[modelscope/Llama-2-70b-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|llama2-70b-chat|[modelscope/Llama-2-70b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|✔|| -|yi-6b|[01ai/Yi-6B](https://modelscope.cn/models/01ai/Yi-6B/summary)|q_proj, k_proj, v_proj|default-generation|✔|| -|yi-6b-200k|[01ai/Yi-6B-200K](https://modelscope.cn/models/01ai/Yi-6B-200K/summary)|q_proj, k_proj, v_proj|default-generation|✔|| -|yi-6b-chat|[01ai/Yi-6B-Chat](https://modelscope.cn/models/01ai/Yi-6B-Chat/summary)|q_proj, k_proj, v_proj|yi|✔|| -|yi-34b|[01ai/Yi-34B](https://modelscope.cn/models/01ai/Yi-34B/summary)|q_proj, k_proj, v_proj|default-generation|✔|| -|yi-34b-200k|[01ai/Yi-34B-200K](https://modelscope.cn/models/01ai/Yi-34B-200K/summary)|q_proj, k_proj, v_proj|default-generation|✔|| -|yi-34b-chat|[01ai/Yi-34B-Chat](https://modelscope.cn/models/01ai/Yi-34B-Chat/summary)|q_proj, k_proj, v_proj|yi|✔|| -|deepseek-7b|[deepseek-ai/deepseek-llm-7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|deepseek-7b-chat|[deepseek-ai/deepseek-llm-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek|✔|| -|deepseek-67b|[deepseek-ai/deepseek-llm-67b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|deepseek-67b-chat|[deepseek-ai/deepseek-llm-67b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-chat/summary)|q_proj, k_proj, v_proj|deepseek|✔|| -|openbuddy-llama2-13b-chat|[OpenBuddy/openbuddy-llama2-13b-v8.1-fp16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16/summary)|q_proj, k_proj, v_proj|openbuddy|✔|| -|openbuddy-llama-65b-chat|[OpenBuddy/openbuddy-llama-65b-v8-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama-65b-v8-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|✔|| -|openbuddy-llama2-70b-chat|[OpenBuddy/openbuddy-llama2-70b-v10.1-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|✔|| -|openbuddy-mistral-7b-chat|[OpenBuddy/openbuddy-mistral-7b-v13.1](https://modelscope.cn/models/OpenBuddy/openbuddy-mistral-7b-v13.1/summary)|q_proj, k_proj, v_proj|openbuddy|✔|transformers>=4.34| -|openbuddy-zephyr-7b-chat|[OpenBuddy/openbuddy-zephyr-7b-v14.1](https://modelscope.cn/models/OpenBuddy/openbuddy-zephyr-7b-v14.1/summary)|q_proj, k_proj, v_proj|openbuddy|✔|transformers>=4.34| -|openbuddy-deepseek-67b-chat|[OpenBuddy/openbuddy-deepseek-67b-v15.2](https://modelscope.cn/models/OpenBuddy/openbuddy-deepseek-67b-v15.2/summary)|q_proj, k_proj, v_proj|openbuddy|✔|| -|mistral-7b|[AI-ModelScope/Mistral-7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|transformers>=4.34| -|mistral-7b-chat|[AI-ModelScope/Mistral-7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|✔|transformers>=4.34| -|mistral-7b-chat-v2|[AI-ModelScope/Mistral-7B-Instruct-v0.2](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.2/summary)|q_proj, k_proj, v_proj|llama|✔|transformers>=4.34| -|mixtral-7b-moe|[AI-ModelScope/Mixtral-8x7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|transformers>=4.36| -|mixtral-7b-moe-chat|[AI-ModelScope/Mixtral-8x7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|✔|transformers>=4.36| -|baichuan-7b|[baichuan-inc/baichuan-7B](https://modelscope.cn/models/baichuan-inc/baichuan-7B/summary)|W_pack|default-generation|✘|transformers<4.34| -|baichuan-13b|[baichuan-inc/Baichuan-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Base/summary)|W_pack|default-generation|✘|transformers<4.34| -|baichuan-13b-chat|[baichuan-inc/Baichuan-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Chat/summary)|W_pack|baichuan|✘|transformers<4.34| -|baichuan2-7b|[baichuan-inc/Baichuan2-7B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Base/summary)|W_pack|default-generation|✘|| -|baichuan2-7b-chat|[baichuan-inc/Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat/summary)|W_pack|baichuan|✘|| -|baichuan2-7b-chat-int4|[baichuan-inc/Baichuan2-7B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat-4bits/summary)|W_pack|baichuan|✘|| -|baichuan2-13b|[baichuan-inc/Baichuan2-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Base/summary)|W_pack|default-generation|✘|| -|baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|✘|| -|baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|✘|| -|internlm-7b|[Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|| -|internlm-7b-chat|[Shanghai_AI_Laboratory/internlm-chat-7b-v1_1](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-v1_1/summary)|q_proj, k_proj, v_proj|internlm|✘|| -|internlm-7b-chat-8k|[Shanghai_AI_Laboratory/internlm-chat-7b-8k](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-8k/summary)|q_proj, k_proj, v_proj|internlm|✘|| -|internlm-20b|[Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|| -|internlm-20b-chat|[Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b/summary)|q_proj, k_proj, v_proj|internlm|✘|| -|xverse-7b|[xverse/XVERSE-7B](https://modelscope.cn/models/xverse/XVERSE-7B/summary)|q_proj, k_proj, v_proj|default-generation|✘|| -|xverse-7b-chat|[xverse/XVERSE-7B-Chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary)|q_proj, k_proj, v_proj|xverse|✘|| -|xverse-13b|[xverse/XVERSE-13B](https://modelscope.cn/models/xverse/XVERSE-13B/summary)|q_proj, k_proj, v_proj|default-generation|✘|| -|xverse-13b-chat|[xverse/XVERSE-13B-Chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)|q_proj, k_proj, v_proj|xverse|✘|| -|xverse-65b|[xverse/XVERSE-65B](https://modelscope.cn/models/xverse/XVERSE-65B/summary)|q_proj, k_proj, v_proj|default-generation|✘|| -|bluelm-7b|[vivo-ai/BlueLM-7B-Base](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|| -|bluelm-7b-32k|[vivo-ai/BlueLM-7B-Base-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base-32K/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|| -|bluelm-7b-chat|[vivo-ai/BlueLM-7B-Chat](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat/summary)|q_proj, k_proj, v_proj|bluelm|✘|| -|bluelm-7b-chat-32k|[vivo-ai/BlueLM-7B-Chat-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat-32K/summary)|q_proj, k_proj, v_proj|bluelm|✘|| -|ziya2-13b|[Fengshenbang/Ziya2-13B-Base](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|ziya2-13b-chat|[Fengshenbang/Ziya2-13B-Chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)|q_proj, k_proj, v_proj|ziya|✔|| -|skywork-13b|[skywork/Skywork-13B-base](https://modelscope.cn/models/skywork/Skywork-13B-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|| -|skywork-13b-chat|[skywork/Skywork-13B-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)|q_proj, k_proj, v_proj|skywork|✘|| -|zephyr-7b-beta-chat|[modelscope/zephyr-7b-beta](https://modelscope.cn/models/modelscope/zephyr-7b-beta/summary)|q_proj, k_proj, v_proj|zephyr|✔|transformers>=4.34| -|sus-34b-chat|[SUSTC/SUS-Chat-34B](https://modelscope.cn/models/SUSTC/SUS-Chat-34B/summary)|q_proj, k_proj, v_proj|sus|✔|| -|polylm-13b|[damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)|c_attn|default-generation|✘|| -|seqgpt-560m|[damo/nlp_seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)|query_key_value|default-generation|✘|| -|tongyi-finance-14b|[TongyiFinance/Tongyi-Finance-14B](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B/summary)|c_attn|default-generation|✔|| -|tongyi-finance-14b-chat|[TongyiFinance/Tongyi-Finance-14B-Chat](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat/summary)|c_attn|chatml|✔|| -|tongyi-finance-14b-chat-int4|[TongyiFinance/Tongyi-Finance-14B-Chat-Int4](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat-Int4/summary)|c_attn|chatml|✔|auto_gptq>=0.5| -|codefuse-codellama-34b-chat|[codefuse-ai/CodeFuse-CodeLlama-34B](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary)|q_proj, k_proj, v_proj|codefuse-codellama|✔|| -|deepseek-coder-1_3b|[deepseek-ai/deepseek-coder-1.3b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|deepseek-coder-1_3b-chat|[deepseek-ai/deepseek-coder-1.3b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|✔|| -|deepseek-coder-6_7b|[deepseek-ai/deepseek-coder-6.7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|deepseek-coder-6_7b-chat|[deepseek-ai/deepseek-coder-6.7b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|✔|| -|deepseek-coder-33b|[deepseek-ai/deepseek-coder-33b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|| -|deepseek-coder-33b-chat|[deepseek-ai/deepseek-coder-33b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|✔|| +| Model Type | Model ID | Default Lora Target Modules | Default Template | Support Flash Attn | Support VLLM | Requires | +| --------- | -------- | --------------------------- | ---------------- | ------------------ | ------------ | -------- | +|qwen-1_8b|[qwen/Qwen-1_8B](https://modelscope.cn/models/qwen/Qwen-1_8B/summary)|c_attn|default-generation|✔|✔|| +|qwen-1_8b-chat|[qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary)|c_attn|chatml|✔|✔|| +|qwen-1_8b-chat-int4|[qwen/Qwen-1_8B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-1_8b-chat-int8|[qwen/Qwen-1_8B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-7b|[qwen/Qwen-7B](https://modelscope.cn/models/qwen/Qwen-7B/summary)|c_attn|default-generation|✔|✔|| +|qwen-7b-chat|[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary)|c_attn|chatml|✔|✔|| +|qwen-7b-chat-int4|[qwen/Qwen-7B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-7b-chat-int8|[qwen/Qwen-7B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-14b|[qwen/Qwen-14B](https://modelscope.cn/models/qwen/Qwen-14B/summary)|c_attn|default-generation|✔|✔|| +|qwen-14b-chat|[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary)|c_attn|chatml|✔|✔|| +|qwen-14b-chat-int4|[qwen/Qwen-14B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-14b-chat-int8|[qwen/Qwen-14B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-72b|[qwen/Qwen-72B](https://modelscope.cn/models/qwen/Qwen-72B/summary)|c_attn|default-generation|✔|✔|| +|qwen-72b-chat|[qwen/Qwen-72B-Chat](https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary)|c_attn|chatml|✔|✔|| +|qwen-72b-chat-int4|[qwen/Qwen-72B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-72b-chat-int8|[qwen/Qwen-72B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-vl|[qwen/Qwen-VL](https://modelscope.cn/models/qwen/Qwen-VL/summary)|c_attn|default-generation|✔|✘|| +|qwen-vl-chat|[qwen/Qwen-VL-Chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary)|c_attn|chatml|✔|✘|| +|qwen-vl-chat-int4|[qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|qwen-audio|[qwen/Qwen-Audio](https://modelscope.cn/models/qwen/Qwen-Audio/summary)|c_attn|default-generation|✔|✘|| +|qwen-audio-chat|[qwen/Qwen-Audio-Chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary)|c_attn|chatml|✔|✘|| +|chatglm2-6b|[ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b/summary)|query_key_value|chatglm2|✘|✔|| +|chatglm2-6b-32k|[ZhipuAI/chatglm2-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm2-6b-32k/summary)|query_key_value|chatglm2|✘|✔|| +|chatglm3-6b-base|[ZhipuAI/chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/summary)|query_key_value|chatglm-generation|✘|✔|| +|chatglm3-6b|[ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary)|query_key_value|chatglm3|✘|✔|| +|chatglm3-6b-32k|[ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary)|query_key_value|chatglm3|✘|✔|| +|llama2-7b|[modelscope/Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|llama2-7b-chat|[modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|✔|✔|| +|llama2-13b|[modelscope/Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|llama2-13b-chat|[modelscope/Llama-2-13b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|✔|✔|| +|llama2-70b|[modelscope/Llama-2-70b-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|llama2-70b-chat|[modelscope/Llama-2-70b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|✔|✔|| +|yi-6b|[01ai/Yi-6B](https://modelscope.cn/models/01ai/Yi-6B/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|| +|yi-6b-200k|[01ai/Yi-6B-200K](https://modelscope.cn/models/01ai/Yi-6B-200K/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|| +|yi-6b-chat|[01ai/Yi-6B-Chat](https://modelscope.cn/models/01ai/Yi-6B-Chat/summary)|q_proj, k_proj, v_proj|yi|✔|✔|| +|yi-34b|[01ai/Yi-34B](https://modelscope.cn/models/01ai/Yi-34B/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|| +|yi-34b-200k|[01ai/Yi-34B-200K](https://modelscope.cn/models/01ai/Yi-34B-200K/summary)|q_proj, k_proj, v_proj|default-generation|✔|✔|| +|yi-34b-chat|[01ai/Yi-34B-Chat](https://modelscope.cn/models/01ai/Yi-34B-Chat/summary)|q_proj, k_proj, v_proj|yi|✔|✔|| +|deepseek-7b|[deepseek-ai/deepseek-llm-7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|deepseek-7b-chat|[deepseek-ai/deepseek-llm-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek|✔|✔|| +|deepseek-67b|[deepseek-ai/deepseek-llm-67b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|deepseek-67b-chat|[deepseek-ai/deepseek-llm-67b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-chat/summary)|q_proj, k_proj, v_proj|deepseek|✔|✔|| +|openbuddy-llama2-13b-chat|[OpenBuddy/openbuddy-llama2-13b-v8.1-fp16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16/summary)|q_proj, k_proj, v_proj|openbuddy|✔|✔|| +|openbuddy-llama-65b-chat|[OpenBuddy/openbuddy-llama-65b-v8-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama-65b-v8-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|✔|✔|| +|openbuddy-llama2-70b-chat|[OpenBuddy/openbuddy-llama2-70b-v10.1-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|✔|✔|| +|openbuddy-mistral-7b-chat|[OpenBuddy/openbuddy-mistral-7b-v13.1](https://modelscope.cn/models/OpenBuddy/openbuddy-mistral-7b-v13.1/summary)|q_proj, k_proj, v_proj|openbuddy|✔|✔|transformers>=4.34| +|openbuddy-zephyr-7b-chat|[OpenBuddy/openbuddy-zephyr-7b-v14.1](https://modelscope.cn/models/OpenBuddy/openbuddy-zephyr-7b-v14.1/summary)|q_proj, k_proj, v_proj|openbuddy|✔|✔|transformers>=4.34| +|openbuddy-deepseek-67b-chat|[OpenBuddy/openbuddy-deepseek-67b-v15.2](https://modelscope.cn/models/OpenBuddy/openbuddy-deepseek-67b-v15.2/summary)|q_proj, k_proj, v_proj|openbuddy|✔|✔|| +|mistral-7b|[AI-ModelScope/Mistral-7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|transformers>=4.34| +|mistral-7b-chat|[AI-ModelScope/Mistral-7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|✔|✔|transformers>=4.34| +|mistral-7b-chat-v2|[AI-ModelScope/Mistral-7B-Instruct-v0.2](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.2/summary)|q_proj, k_proj, v_proj|llama|✔|✔|transformers>=4.34| +|mixtral-7b-moe|[AI-ModelScope/Mixtral-8x7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|transformers>=4.36| +|mixtral-7b-moe-chat|[AI-ModelScope/Mixtral-8x7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|✔|✔|transformers>=4.36| +|baichuan-7b|[baichuan-inc/baichuan-7B](https://modelscope.cn/models/baichuan-inc/baichuan-7B/summary)|W_pack|default-generation|✘|✔|transformers<4.34| +|baichuan-13b|[baichuan-inc/Baichuan-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Base/summary)|W_pack|default-generation|✘|✔|transformers<4.34| +|baichuan-13b-chat|[baichuan-inc/Baichuan-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Chat/summary)|W_pack|baichuan|✘|✔|transformers<4.34| +|baichuan2-7b|[baichuan-inc/Baichuan2-7B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Base/summary)|W_pack|default-generation|✘|✔|| +|baichuan2-7b-chat|[baichuan-inc/Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat/summary)|W_pack|baichuan|✘|✔|| +|baichuan2-7b-chat-int4|[baichuan-inc/Baichuan2-7B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat-4bits/summary)|W_pack|baichuan|✘|✘|| +|baichuan2-13b|[baichuan-inc/Baichuan2-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Base/summary)|W_pack|default-generation|✘|✔|| +|baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|✘|✔|| +|baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|✘|✘|| +|internlm-7b|[Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|✔|| +|internlm-7b-chat|[Shanghai_AI_Laboratory/internlm-chat-7b-v1_1](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-v1_1/summary)|q_proj, k_proj, v_proj|internlm|✘|✔|| +|internlm-7b-chat-8k|[Shanghai_AI_Laboratory/internlm-chat-7b-8k](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-8k/summary)|q_proj, k_proj, v_proj|internlm|✘|✔|| +|internlm-20b|[Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|✔|| +|internlm-20b-chat|[Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b/summary)|q_proj, k_proj, v_proj|internlm|✘|✔|| +|xverse-7b|[xverse/XVERSE-7B](https://modelscope.cn/models/xverse/XVERSE-7B/summary)|q_proj, k_proj, v_proj|default-generation|✘|✘|| +|xverse-7b-chat|[xverse/XVERSE-7B-Chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary)|q_proj, k_proj, v_proj|xverse|✘|✘|| +|xverse-13b|[xverse/XVERSE-13B](https://modelscope.cn/models/xverse/XVERSE-13B/summary)|q_proj, k_proj, v_proj|default-generation|✘|✘|| +|xverse-13b-chat|[xverse/XVERSE-13B-Chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)|q_proj, k_proj, v_proj|xverse|✘|✘|| +|xverse-65b|[xverse/XVERSE-65B](https://modelscope.cn/models/xverse/XVERSE-65B/summary)|q_proj, k_proj, v_proj|default-generation|✘|✘|| +|bluelm-7b|[vivo-ai/BlueLM-7B-Base](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|✘|| +|bluelm-7b-32k|[vivo-ai/BlueLM-7B-Base-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base-32K/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|✘|| +|bluelm-7b-chat|[vivo-ai/BlueLM-7B-Chat](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat/summary)|q_proj, k_proj, v_proj|bluelm|✘|✘|| +|bluelm-7b-chat-32k|[vivo-ai/BlueLM-7B-Chat-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat-32K/summary)|q_proj, k_proj, v_proj|bluelm|✘|✘|| +|ziya2-13b|[Fengshenbang/Ziya2-13B-Base](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|ziya2-13b-chat|[Fengshenbang/Ziya2-13B-Chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)|q_proj, k_proj, v_proj|ziya|✔|✔|| +|skywork-13b|[skywork/Skywork-13B-base](https://modelscope.cn/models/skywork/Skywork-13B-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✘|✘|| +|skywork-13b-chat|[skywork/Skywork-13B-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)|q_proj, k_proj, v_proj|skywork|✘|✘|| +|zephyr-7b-beta-chat|[modelscope/zephyr-7b-beta](https://modelscope.cn/models/modelscope/zephyr-7b-beta/summary)|q_proj, k_proj, v_proj|zephyr|✔|✔|transformers>=4.34| +|sus-34b-chat|[SUSTC/SUS-Chat-34B](https://modelscope.cn/models/SUSTC/SUS-Chat-34B/summary)|q_proj, k_proj, v_proj|sus|✔|✔|| +|polylm-13b|[damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)|c_attn|default-generation|✘|✘|| +|seqgpt-560m|[damo/nlp_seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)|query_key_value|default-generation|✘|✔|| +|tongyi-finance-14b|[TongyiFinance/Tongyi-Finance-14B](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B/summary)|c_attn|default-generation|✔|✔|| +|tongyi-finance-14b-chat|[TongyiFinance/Tongyi-Finance-14B-Chat](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat/summary)|c_attn|chatml|✔|✔|| +|tongyi-finance-14b-chat-int4|[TongyiFinance/Tongyi-Finance-14B-Chat-Int4](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat-Int4/summary)|c_attn|chatml|✔|✘|auto_gptq>=0.5| +|codefuse-codellama-34b-chat|[codefuse-ai/CodeFuse-CodeLlama-34B](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary)|q_proj, k_proj, v_proj|codefuse-codellama|✔|✔|| +|deepseek-coder-1_3b|[deepseek-ai/deepseek-coder-1.3b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|deepseek-coder-1_3b-chat|[deepseek-ai/deepseek-coder-1.3b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|✔|✔|| +|deepseek-coder-6_7b|[deepseek-ai/deepseek-coder-6.7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|deepseek-coder-6_7b-chat|[deepseek-ai/deepseek-coder-6.7b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|✔|✔|| +|deepseek-coder-33b|[deepseek-ai/deepseek-coder-33b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|✔|✔|| +|deepseek-coder-33b-chat|[deepseek-ai/deepseek-coder-33b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|✔|✔|| ## 数据集 diff --git a/docs/source/LLM/自我认知微调最佳实践.md b/docs/source/LLM/自我认知微调最佳实践.md index 785463c31..97636e549 100644 --- a/docs/source/LLM/自我认知微调最佳实践.md +++ b/docs/source/LLM/自我认知微调最佳实践.md @@ -283,6 +283,7 @@ CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'qwen-7b-chat/vx-xxx/checkpoint-x ## 了解更多 - 快速对LLM进行**推理**, 搭建**Web-UI**, 可以查看[LLM推理文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md). - 快速对LLM进行**微调**, 推理并搭建Web-UI. 可以查看[LLM微调文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md). +- 使用VLLM进行**推理加速**和**部署**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md). - 查看swift支持的模型和数据集. 可以查看[支持的模型和数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md). - 对swift中的模型, 数据集, 对话模板进行**拓展**, 可以查看[自定义与拓展](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md). - 查询微调和推理的命令行参数, 可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md). diff --git a/examples/pytorch/llm/app.py b/examples/pytorch/llm/app.py index 8d7e9c40a..c9a208303 100644 --- a/examples/pytorch/llm/app.py +++ b/examples/pytorch/llm/app.py @@ -1,8 +1,9 @@ +# Copyright (c) Alibaba, Inc. and its affiliates. # import os # os.environ['CUDA_VISIBLE_DEVICES'] = '0' +import custom -from swift.llm import InferArguments, ModelType -from swift.llm.run import app_ui_main +from swift.llm import InferArguments, ModelType, app_ui_main if __name__ == '__main__': # Please refer to the `infer.sh` for setting the parameters. diff --git a/examples/pytorch/llm/llm_infer.py b/examples/pytorch/llm/llm_infer.py index 1e247b46e..7fa096807 100644 --- a/examples/pytorch/llm/llm_infer.py +++ b/examples/pytorch/llm/llm_infer.py @@ -1,7 +1,7 @@ # Copyright (c) Alibaba, Inc. and its affiliates. import custom -from swift.llm.run import infer_main +from swift.llm import infer_main if __name__ == '__main__': result = infer_main() diff --git a/examples/pytorch/llm/llm_sft.py b/examples/pytorch/llm/llm_sft.py index a1c9fc398..899c6e41e 100644 --- a/examples/pytorch/llm/llm_sft.py +++ b/examples/pytorch/llm/llm_sft.py @@ -1,7 +1,7 @@ # Copyright (c) Alibaba, Inc. and its affiliates. import custom -from swift.llm.run import sft_main +from swift.llm import sft_main if __name__ == '__main__': output = sft_main() diff --git a/examples/pytorch/llm/rome_infer.py b/examples/pytorch/llm/rome_infer.py index 139759a47..db9cc077b 100644 --- a/examples/pytorch/llm/rome_infer.py +++ b/examples/pytorch/llm/rome_infer.py @@ -1,6 +1,6 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from swift.llm.run import rome_main +from swift.llm import rome_main if __name__ == '__main__': rome_main() diff --git a/scripts/utils/test_readme.py b/scripts/tests/test_readme.py similarity index 100% rename from scripts/utils/test_readme.py rename to scripts/tests/test_readme.py diff --git a/scripts/tests/test_vllm.py/main.py b/scripts/tests/test_vllm.py/main.py new file mode 100644 index 000000000..7bf7379bc --- /dev/null +++ b/scripts/tests/test_vllm.py/main.py @@ -0,0 +1,18 @@ +import os +import subprocess + +from swift.llm import ModelType + +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + +if __name__ == '__main__': + model_name_list = ModelType.get_model_name_list() + success_model_list = [] + fpath = os.path.join(os.path.dirname(__file__), 'utils.py') + for model_name in model_name_list: + code = subprocess.run(['python', fpath, '--model_type', model_name]) + if code.returncode == 0: + success_model_list.append(model_name) + else: + print(f'model_name: {model_name} not support vllm.') + print(success_model_list) diff --git a/scripts/tests/test_vllm.py/utils.py b/scripts/tests/test_vllm.py/utils.py new file mode 100644 index 000000000..4abe73528 --- /dev/null +++ b/scripts/tests/test_vllm.py/utils.py @@ -0,0 +1,31 @@ +from dataclasses import dataclass + +from swift.llm import (get_default_template_type, get_template, + get_vllm_engine, inference_vllm) +from swift.utils import get_main + + +@dataclass +class VLLMTestArgs: + model_type: str + + +def test_vllm(args: VLLMTestArgs) -> None: + model_type = args.model_type + llm_engine = get_vllm_engine(model_type) + template_type = get_default_template_type(model_type) + template = get_template(template_type, llm_engine.tokenizer) + + llm_engine.generation_config.max_new_tokens = 256 + + request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪?'}] + resp_list = inference_vllm(llm_engine, template, request_list) + for request, resp in zip(request_list, resp_list): + print(f"query: {request['query']}") + print(f"response: {resp['response']}") + + +test_vllm_main = get_main(VLLMTestArgs, test_vllm) + +if __name__ == '__main__': + test_vllm_main() diff --git a/scripts/utils/run_model_info.py b/scripts/utils/run_model_info.py index f062e657c..43c9b6154 100644 --- a/scripts/utils/run_model_info.py +++ b/scripts/utils/run_model_info.py @@ -8,9 +8,9 @@ def write_model_info_table2(fpath: str) -> None: with open(fpath, 'w', encoding='utf-8') as f: f.write( '| Model Type | Model ID | Default Lora Target Modules | Default Template |' - ' Support Flash Attn | Requires |\n' + ' Support Flash Attn | Support VLLM | Requires |\n' '| --------- | -------- | --------------------------- | ---------------- |' - ' ------------------ | -------- |\n') + ' ------------------ | ------------ | -------- |\n') res = [] bool_mapping = {True: '✔', False: '✘'} for model_name in model_name_list: @@ -20,16 +20,18 @@ def write_model_info_table2(fpath: str) -> None: template = model_info['template'] support_flash_attn = model_info.get('support_flash_attn', False) support_flash_attn = bool_mapping[support_flash_attn] + support_vllm = model_info.get('support_vllm', False) + support_vllm = bool_mapping[support_vllm] requires = ', '.join(model_info['requires']) r = [ model_name, model_id, lora_target_modules, template, - support_flash_attn, requires + support_flash_attn, support_vllm, requires ] res.append(r) text = '' for r in res: url = f'https://modelscope.cn/models/{r[1]}/summary' - text += f'|{r[0]}|[{r[1]}]({url})|{r[2]}|{r[3]}|{r[4]}|{r[5]}|\n' + text += f'|{r[0]}|[{r[1]}]({url})|{r[2]}|{r[3]}|{r[4]}|{r[5]}|{r[6]}|\n' with open(fpath, 'a', encoding='utf-8') as f: f.write(text) print() diff --git a/swift/cli/app_ui.py b/swift/cli/app_ui.py index 93734c2d4..b3b135539 100644 --- a/swift/cli/app_ui.py +++ b/swift/cli/app_ui.py @@ -1,5 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from swift.llm.run import app_ui_main +from swift.llm import app_ui_main if __name__ == '__main__': app_ui_main() diff --git a/swift/cli/infer.py b/swift/cli/infer.py index d855ae735..2dce4f3ac 100644 --- a/swift/cli/infer.py +++ b/swift/cli/infer.py @@ -1,5 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from swift.llm.run import infer_main +from swift.llm import infer_main if __name__ == '__main__': infer_main() diff --git a/swift/cli/main.py b/swift/cli/main.py index d3aa4a3d6..2e1c92a45 100644 --- a/swift/cli/main.py +++ b/swift/cli/main.py @@ -1,17 +1,16 @@ # Copyright (c) Alibaba, Inc. and its affiliates. +import importlib.util import os import subprocess import sys from typing import Dict, List, Optional -from swift.cli import app_ui, infer, merge_lora, sft, ui - ROUTE_MAPPING: Dict[str, str] = { - 'sft': sft.__file__, - 'infer': infer.__file__, - 'app-ui': app_ui.__file__, - 'merge-lora': merge_lora.__file__, - 'web-ui': ui.__file__ + 'sft': 'swift.cli.sft', + 'infer': 'swift.cli.infer', + 'app-ui': 'swift.cli.app_ui', + 'merge-lora': 'swift.cli.merge_lora', + 'web-ui': 'swift.cli.web_ui' } ROUTE_MAPPING.update( @@ -46,7 +45,7 @@ def cli_main() -> None: argv = sys.argv[1:] method_name = argv[0] argv = argv[1:] - file_path = ROUTE_MAPPING[method_name] + file_path = importlib.util.find_spec(ROUTE_MAPPING[method_name]).origin torchrun_args = get_torchrun_args() if torchrun_args is None or method_name != 'sft': args = ['python', file_path, *argv] diff --git a/swift/cli/merge_lora.py b/swift/cli/merge_lora.py index e17f453b4..5d35074b2 100644 --- a/swift/cli/merge_lora.py +++ b/swift/cli/merge_lora.py @@ -1,5 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from swift.llm.run import merge_lora_main +from swift.llm import merge_lora_main if __name__ == '__main__': merge_lora_main(replace_if_exists=True) diff --git a/swift/cli/sft.py b/swift/cli/sft.py index 54d5ad638..6e52c4e0e 100644 --- a/swift/cli/sft.py +++ b/swift/cli/sft.py @@ -1,5 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from swift.llm.run import sft_main +from swift.llm import sft_main if __name__ == '__main__': sft_main() diff --git a/swift/cli/ui.py b/swift/cli/ui.py deleted file mode 100644 index d494d112c..000000000 --- a/swift/cli/ui.py +++ /dev/null @@ -1,4 +0,0 @@ -from swift.ui.app import run_ui - -if __name__ == '__main__': - run_ui() diff --git a/swift/cli/web_ui.py b/swift/cli/web_ui.py index 93734c2d4..53d1f02a6 100644 --- a/swift/cli/web_ui.py +++ b/swift/cli/web_ui.py @@ -1,5 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from swift.llm.run import app_ui_main +from swift.ui.app import run_ui if __name__ == '__main__': - app_ui_main() + run_ui() diff --git a/swift/llm/app_ui.py b/swift/llm/app_ui.py index 0cf770645..26fbe49ed 100644 --- a/swift/llm/app_ui.py +++ b/swift/llm/app_ui.py @@ -1,7 +1,7 @@ # Copyright (c) Alibaba, Inc. and its affiliates. from typing import Tuple -from .infer import prepare_model_template +from .infer import merge_lora, prepare_model_template from .utils import (History, InferArguments, inference_stream, limit_history_length) @@ -12,12 +12,26 @@ def clear_session() -> History: def gradio_generation_demo(args: InferArguments) -> None: import gradio as gr - model, template = prepare_model_template(args) + if args.merge_lora_and_save: + merge_lora(args) + if args.infer_backend == 'vllm': + from swift.llm import prepare_vllm_engine_template, inference_stream_vllm, inference_vllm + llm_engine, template = prepare_vllm_engine_template(args) + else: + model, template = prepare_model_template(args) def model_generation(query: str) -> str: - gen = inference_stream(model, template, query, None) - for response, _ in gen: - yield response + if args.infer_backend == 'vllm': + gen = inference_stream_vllm(llm_engine, template, [{ + 'query': query + }]) + for resp_list in gen: + response = resp_list[0]['response'] + yield response + else: + gen = inference_stream(model, template, query, None) + for response, _ in gen: + yield response model_name = args.model_type.title() @@ -35,22 +49,39 @@ def gradio_generation_demo(args: InferArguments) -> None: def gradio_chat_demo(args: InferArguments) -> None: import gradio as gr - model, template = prepare_model_template(args) + if args.merge_lora_and_save: + merge_lora(args) + if args.infer_backend == 'vllm': + from swift.llm import prepare_vllm_engine_template, inference_stream_vllm + llm_engine, template = prepare_vllm_engine_template(args) + else: + model, template = prepare_model_template(args) def model_chat(query: str, history: History) -> Tuple[str, History]: old_history, history = limit_history_length(template, query, history, args.max_length) - gen = inference_stream(model, template, query, history) - for _, history in gen: - total_history = old_history + history - yield '', total_history + if args.infer_backend == 'vllm': + gen = inference_stream_vllm(llm_engine, template, + [{ + 'query': query, + 'history': history + }]) + for resp_list in gen: + history = resp_list[0]['history'] + total_history = old_history + history + yield '', total_history + else: + gen = inference_stream(model, template, query, history) + for _, history in gen: + total_history = old_history + history + yield '', total_history model_name = args.model_type.title() with gr.Blocks() as demo: gr.Markdown(f'