diff --git a/README.md b/README.md
index 8e0c0b069..af7a5b46f 100644
--- a/README.md
+++ b/README.md
@@ -60,6 +60,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 
 
 ## 🎉 News
+- 2023.12.18: Support for **VLLM** for inference acceleration and deployment. For more details, refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
 - 2023.12.15: Support **deepseek**, **deepseek-coder** series: deepseek-7b, deepseek-7b-chat, deepseek-67b, deepseek-67b-chat, openbuddy-deepseek-67b-chat, deepseek-coder-1_3b, deepseek-coder-1_3b-chat, deepseek-coder-6_7b, deepseek-coder-6_7b-chat, deepseek-coder-33b, deepseek-coder-33b-chat.
 - 2023.12.13: Support mistral-7b-chat-v2, [mixtral-7b-moe](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe), [mixtral-7b-moe-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe_chat).
 - 2023.12.9: Support the `freeze_parameters` parameter as a compromise between LoRA and full parameter. Corresponding shell scripts can be found at [full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp). Support `disable_tqdm`, `lazy_tokenize`, `preprocess_num_proc` parameters, for details please refer to [Command-Line parameters](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).
@@ -102,6 +103,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 - **Self-cognitionfine-tuning** for large models in **10 minutes**, creating a personalized large model, please refer to [Best Practices for Self-cognition Fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自我认知微调最佳实践.md).
 - Quickly perform **inference** on LLM and build a **Web-UI**, see the [LLM Inference Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - Rapidly **fine-tune** and perform inference on LLM, and build a Web-UI. See the [LLM Fine-tuning Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md).
+- Utilize VLLM for **inference acceleration** and **deployment**. Please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md) for more information.
 - View the models and datasets supported by Swift. You can check [supported models and datasets](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - Expand and customize models, datasets, and dialogue templates in Swift, see [Customization and Expansion](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
 - Check command-line parameters for fine-tuning and inference, see [Command-Line parameters](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).
diff --git a/README_CN.md b/README_CN.md
index c550a6671..0d6a213e3 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -58,6 +58,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 用户可以查看 [SWIFT官方文档](docs/source/GetStarted/快速使用.md) 来了解详细信息。
 
 ## 🎉 新闻
+- 2023.12.18: 支持**VLLM**进行推理加速和部署. 具体可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
 - 2023.12.15: 支持**deepseek**, **deepseek-coder**系列: deepseek-7b, deepseek-7b-chat, deepseek-67b, deepseek-67b-chat, openbuddy-deepseek-67b-chat, deepseek-coder-1_3b, deepseek-coder-1_3b-chat, deepseek-coder-6_7b, deepseek-coder-6_7b-chat, deepseek-coder-33b, deepseek-coder-33b-chat.
 - 2023.12.13: 支持mistral-7b-chat-v2, [mixtral-7b-moe](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe), [mixtral-7b-moe-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/mixtral_7b_moe_chat).
 - 2023.12.9: 支持`freeze_parameters`参数, 作为lora和全参数训练的折中方案. 对应的sh可以查看[full_freeze_ddp](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_freeze_ddp). 支持`disable_tqdm`, `lazy_tokenize`, `preprocess_num_proc`参数, 具体可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).
@@ -100,6 +101,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 - **10分钟**对大模型进行**自我认知微调**, 创建专属于自己的大模型, 可以查看[自我认知微调最佳实践](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自我认知微调最佳实践.md).
 - 快速对LLM进行**推理**, 搭建**Web-UI**, 可以查看[LLM推理文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - 快速对LLM进行**微调**, 推理并搭建Web-UI. 可以查看[LLM微调文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md).
+- 使用VLLM进行**推理加速**和**部署**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
 - 查看swift支持的模型和数据集. 可以查看[支持的模型和数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - 对swift中的模型, 数据集, 对话模板进行**拓展**, 可以查看[自定义与拓展](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
 - 查询微调和推理的命令行参数, 可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).
@@ -333,6 +335,7 @@ output
 
 本项目使用[Apache License (Version 2.0)](https://github.com/modelscope/modelscope/blob/master/LICENSE)进行许可。
 
+
 ## ☎ 联系我们
 
 您可以通过加我们的微信群, 来和我们联系和交流:
diff --git a/docs/source/LLM/LLM微调文档.md b/docs/source/LLM/LLM微调文档.md
index a194a995a..ffa72c5d6 100644
--- a/docs/source/LLM/LLM微调文档.md
+++ b/docs/source/LLM/LLM微调文档.md
@@ -222,6 +222,8 @@ swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx'
 ```
 
 ## 推理
+如果你要使用VLLM进行推理加速, 可以查看[VLLM推理加速与部署](./VLLM推理加速与部署.md#微调后的模型)
+
 ### 原始模型
 **单样本推理**可以查看[LLM推理文档](./LLM推理文档.md#-推理)
 
@@ -230,7 +232,7 @@ swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx'
 CUDA_VISIBLE_DEVICES=0 swift infer --model_id_or_path qwen/Qwen-7B-Chat --dataset blossom-math-zh
 ```
 ### 微调后模型
-**单样本推理**
+**单样本推理**:
 
 使用LoRA**增量**权重进行推理:
 ```python
@@ -241,13 +243,12 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
 )
 from swift.tuners import Swift
-import torch
 
 model_dir = 'vx_xxx/checkpoint-100'
 model_type = ModelType.qwen_7b_chat
 template_type = get_default_template_type(model_type)
 
-model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'})
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
 
 model = Swift.from_pretrained(model, model_dir, inference_mode=True)
 template = get_template(template_type, tokenizer)
@@ -265,13 +266,12 @@ os.environ['CUDA_VISIBLE_DEVICES'] = '0'
 from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
 )
-import torch
 
 model_dir = 'vx_xxx/checkpoint-100-merged'
 model_type = ModelType.qwen_7b_chat
 template_type = get_default_template_type(model_type)
 
-model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'},
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'},
                                        model_dir=model_dir)
 
 template = get_template(template_type, tokenizer)
@@ -292,6 +292,8 @@ CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged'
 ```
 
 ## Web-UI
+如果你要使用VLLM进行部署并提供**API**接口, 可以查看[VLLM推理加速与部署](./VLLM推理加速与部署.md#部署)
+
 ### 原始模型
 使用原始模型的web-ui可以查看[LLM推理文档](./LLM推理文档.md#-Web-UI)
 
diff --git a/docs/source/LLM/LLM推理文档.md b/docs/source/LLM/LLM推理文档.md
index 0c0d734bf..210320e8f 100644
--- a/docs/source/LLM/LLM推理文档.md
+++ b/docs/source/LLM/LLM推理文档.md
@@ -1,4 +1,6 @@
 # LLM推理文档
+如果你要使用vllm进行推理加速, 可以查看[VLLM推理加速与部署](./VLLM推理加速与部署.md#推理加速)
+
 ## 目录
 - [环境准备](#环境准备)
 - [推理](#推理)
@@ -34,7 +36,6 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.qwen_7b_chat
 template_type = get_default_template_type(model_type)
@@ -44,7 +45,7 @@ print(f'template_type: {template_type}')  # template_type: chatml
 kwargs = {}
 # kwargs['use_flash_attn'] = True  # 使用flash_attn
 
-model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'}, **kwargs)
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
 # 修改max_new_tokens
 model.generation_config.max_new_tokens = 128
 
@@ -97,7 +98,6 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.qwen_7b_chat_int4
 template_type = get_default_template_type(model_type)
@@ -135,13 +135,12 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.qwen_7b
 template_type = get_default_template_type(model_type)
 print(f'template_type: {template_type}')  # template_type: default-generation
 
-model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16, {'device_map': 'auto'})
+model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
 model.generation_config.max_new_tokens = 64
 template = get_template(template_type, tokenizer)
 seed_everything(42)
@@ -177,7 +176,6 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference_stream, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.qwen_7b_chat
 template_type = get_default_template_type(model_type)
@@ -219,7 +217,6 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.qwen_vl_chat
 template_type = get_default_template_type(model_type)
@@ -262,7 +259,6 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.qwen_audio_chat
 template_type = get_default_template_type(model_type)
@@ -304,7 +300,6 @@ from swift.llm import (
     get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
 )
 from swift.utils import seed_everything
-import torch
 
 model_type = ModelType.chatglm3_6b
 template_type = get_default_template_type(model_type)
@@ -430,7 +425,7 @@ app_ui_main(infer_args)
 ### qwen-7b
 使用CLI:
 ```bash
-swift app-ui --model_id_or_path qwen/Qwen-7B
+CUDA_VISIBLE_DEVICES=0 swift app-ui --model_id_or_path qwen/Qwen-7B
 ```
 
 使用python:
diff --git a/docs/source/LLM/VLLM推理加速与部署.md b/docs/source/LLM/VLLM推理加速与部署.md
new file mode 100644
index 000000000..8af042bb6
--- /dev/null
+++ b/docs/source/LLM/VLLM推理加速与部署.md
@@ -0,0 +1,219 @@
+
+# VLLM推理加速与部署
+
+## 目录
+- [环境准备](#环境准备)
+- [推理加速](#推理加速)
+- [Web-UI加速](#web-ui加速)
+- [部署](#部署)
+
+## 环境准备
+GPU设备: A10, 3090, V100, A100均可.
+```bash
+# 设置pip全局镜像
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e .[llm]
+
+# vllm与cuda版本有对应关系，请按照`https://docs.vllm.ai/en/latest/getting_started/installation.html`选择版本
+pip install vllm -U
+
+# 如果你想要使用基于auto_gptq的模型进行推理.
+# 使用auto_gptq的模型: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md#模型`
+# auto_gptq和cuda版本有对应关系，请按照`https://github.com/PanQiWei/AutoGPTQ#quick-installation`选择版本
+pip install auto_gptq -U
+
+# 环境对齐 (如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+## 推理加速
+
+### qwen-7b-chat
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_vllm_engine, get_default_template_type,
+    get_template, inference_vllm
+)
+
+model_type = ModelType.qwen_7b_chat
+llm_engine = get_vllm_engine(model_type)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, llm_engine.tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+llm_engine.generation_config.max_new_tokens = 256
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+resp_list = inference_vllm(llm_engine, template, request_list)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+resp_list = inference_vllm(llm_engine, template, request_list)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+    print(f"history: {resp['history']}")
+
+"""Out[0]
+query: 你好!
+response: 你好！很高兴为你服务。有什么我可以帮助你的吗？
+query: 浙江的省会在哪？
+response: 浙江省会是杭州市。
+query: 这有什么好吃的
+response: 杭州是一个美食之城，拥有许多著名的菜肴和小吃，例如西湖醋鱼、东坡肉、叫化童子鸡等。此外，杭州还有许多小吃店，可以品尝到各种各样的本地美食。
+history: [('浙江的省会在哪？', '浙江省会是杭州市。'), ('这有什么好吃的', '杭州是一个美食之城，拥有许多著名的菜肴和小吃，例如西湖醋鱼、东坡肉、叫化童子鸡等。此外，杭州还有许多小吃店，可以品尝到各种各样的本地美食。')]
+"""
+```
+
+### 流式输出
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_vllm_engine, get_default_template_type,
+    get_template, inference_stream_vllm
+)
+
+model_type = ModelType.qwen_7b_chat
+llm_engine = get_vllm_engine(model_type)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, llm_engine.tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+llm_engine.generation_config.max_new_tokens = 256
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+gen = inference_stream_vllm(llm_engine, template, request_list)
+query_list = [request['query'] for request in request_list]
+print(f"query_list: {query_list}")
+for resp_list in gen:
+    response_list = [resp['response'] for resp in resp_list]
+    print(f'response_list: {response_list}')
+
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+gen = inference_stream_vllm(llm_engine, template, request_list)
+query = request_list[0]['query']
+print(f"query: {query}")
+for resp_list in gen:
+    response = resp_list[0]['response']
+    print(f'response: {response}')
+
+history = resp_list[0]['history']
+print(f'history: {history}')
+
+"""Out[0]
+query_list: ['你好!', '浙江的省会在哪？']
+...
+response_list: ['你好！很高兴为你服务。有什么我可以帮助你的吗？', '浙江省会是杭州市。']
+query: 这有什么好吃的
+...
+response: 杭州是一个美食之城，拥有许多著名的菜肴和小吃，例如西湖醋鱼、东坡肉、叫化童子鸡等。此外，杭州还有许多小吃店，可以品尝到各种各样的本地美食。
+history: [('浙江的省会在哪？', '浙江省会是杭州市。'), ('这有什么好吃的', '杭州是一个美食之城，拥有许多著名的菜肴和小吃，例如西湖醋鱼、东坡肉、叫化童子鸡等。此外，杭州还有许多小吃店，可以品尝到各种各样的本地美食。')]
+"""
+```
+
+### chatglm3
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_vllm_engine, get_default_template_type,
+    get_template, inference_vllm
+)
+
+model_type = ModelType.chatglm3_6b
+llm_engine = get_vllm_engine(model_type)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, llm_engine.tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+llm_engine.generation_config.max_new_tokens = 256
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+resp_list = inference_vllm(llm_engine, template, request_list)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+resp_list = inference_vllm(llm_engine, template, request_list)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+    print(f"history: {resp['history']}")
+
+"""Out[0]
+query: 你好!
+response: 您好，我是人工智能助手。很高兴为您服务！请问有什么问题我可以帮您解答？
+query: 浙江的省会在哪？
+response: 浙江的省会是杭州。
+query: 这有什么好吃的
+response: 浙江有很多美食,其中一些非常有名的包括杭州的龙井虾仁、东坡肉、西湖醋鱼、叫化童子鸡等。另外,浙江还有很多特色小吃和糕点,比如宁波的汤团、年糕,温州的炒螃蟹、温州肉圆等。
+history: [('浙江的省会在哪？', '浙江的省会是杭州。'), ('这有什么好吃的', '浙江有很多美食,其中一些非常有名的包括杭州的龙井虾仁、东坡肉、西湖醋鱼、叫化童子鸡等。另外,浙江还有很多特色小吃和糕点,比如宁波的汤团、年糕,温州的炒螃蟹、温州肉圆等。')]
+"""
+```
+
+### 微调后的模型
+
+**单样本推理**:
+
+使用LoRA进行微调的模型你需要先[merge-lora](./LLM微调文档.md#merge-lora), 产生完整的checkpoint目录.
+
+使用全参数微调的模型可以无缝使用VLLM进行推理加速.
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_vllm_engine, get_default_template_type,
+    get_template, inference_vllm
+)
+from swift.tuners import Swift
+
+model_dir = 'vx_xxx/checkpoint-100-merged'
+model_type = ModelType.qwen_7b_chat
+template_type = get_default_template_type(model_type)
+
+llm_engine = get_vllm_engine(model_type, model_dir=model_dir)
+tokenizer = llm_engine.tokenizer
+template = get_template(template_type, tokenizer)
+query = '你好'
+resp = inference_vllm(llm_engine, template, [{'query': query}])[0]
+print(f"response: {resp['response']}")
+print(f"history: {resp['history']}")
+```
+
+使用**数据集**评估:
+```bash
+# merge LoRA增量权重并使用vllm进行推理加速
+swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx'
+CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged' --infer_backend vllm
+```
+
+## Web-UI加速
+
+### 原始模型
+```bash
+CUDA_VISIBLE_DEVICES=0 swift app-ui --model_id_or_path qwen/Qwen-7B-Chat --infer_backend vllm
+```
+
+### 微调后模型
+```bash
+# merge LoRA增量权重并使用vllm作为backend构建app-ui
+swift merge-lora --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx'
+CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx_xxx/checkpoint-xxx-merged' --infer_backend vllm
+```
+
+## 部署
+TODO
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
index 7514fd947..99ec009f2 100644
--- a/docs/source/LLM/命令行参数.md
+++ b/docs/source/LLM/命令行参数.md
@@ -91,6 +91,7 @@
 - `--model_cache_dir`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--sft_type`: 默认值为`'lora'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--template_type`: 默认值为`'AUTO'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
+- `--infer_backend`: 你可以选择'AUTO', 'vllm', 'pt'. 默认使用'AUTO', 进行智能选择, 即如果没有传入`ckpt_dir`或使用全参数微调, 并且安装了vllm且模型支持vllm则使用vllm引擎, 否则使用原生torch进行推理. vllm环境准备可以参考[VLLM推理加速与部署](./VLLM推理加速与部署.md#环境准备).
 - `--ckpt_dir`: 必填项, 值为SFT阶段保存的checkpoint路径, e.g. `'/path/to/your/vx_xxx/checkpoint-xxx'`.
 - `--load_args_from_ckpt_dir`: 是否从`ckpt_dir`的`sft_args.json`文件中读取配置信息. 默认是`True`.
 - `--load_dataset_config`: 该参数只有在`--load_args_from_ckpt_dir true`时才生效. 即是否从`ckpt_dir`的`sft_args.json`文件中读取数据集相关的配置信息. 默认为`True`.
@@ -125,3 +126,5 @@
 - `--overwrite_generation_config`: 是否将评估所使用的generation_config保存成`generation_config.json`文件, 默认为`False`. 训练时保存的generation_config文件将被覆盖.
 - `--verbose`: 如果设置为False, 则使用tqdm样式推理. 如果设置为True, 则输出推理的query, response, label. 默认为`None`, 进行自动选择, 即`len(val_dataset) >= 100`时, 设置为False, 否则设置为True. 该参数只有在`--eval_human false`时才生效.
 - `--share`: 传递给gradio的`demo.queue().launch(...)`函数. 该参数只有在使用`app-ui`时才生效.
+- `--gpu_memory_utilization`: 初始化vllm引擎`EngineArgs`的参数, 默认为`0.9`. 该参数只有在使用vllm时才生效.
+- `--tensor_parallel_size`: 初始化vllm引擎`EngineArgs`的参数, 默认为`1`. 该参数只有在使用vllm时才生效.
diff --git a/docs/source/LLM/支持的模型和数据集.md b/docs/source/LLM/支持的模型和数据集.md
index 64a4c3600..b45b2369c 100644
--- a/docs/source/LLM/支持的模型和数据集.md
+++ b/docs/source/LLM/支持的模型和数据集.md
@@ -8,105 +8,106 @@
 - Model List: 模型在swift中注册的model_type的列表.
 - Default Lora Target Modules: 对应模型的默认lora_target_modules.
 - Default Template: 对应模型的默认template.
-- Support Flash Attn: 模型是否支持[flash attention](https://github.com/Dao-AILab/flash-attention).
+- Support Flash Attn: 模型是否支持[flash attention](https://github.com/Dao-AILab/flash-attention)加速推理和微调.
+- Support VLLM: 模型是否支持[vllm](https://github.com/vllm-project/vllm)加速推理和部署.
 - Requires: 对应模型所需的额外依赖要求.
 
-| Model Type | Model ID | Default Lora Target Modules | Default Template | Support Flash Attn | Requires |
-| ---------  | -------- | --------------------------- | ---------------- | ------------------ | -------- |
-|qwen-1_8b|[qwen/Qwen-1_8B](https://modelscope.cn/models/qwen/Qwen-1_8B/summary)|c_attn|default-generation|&#x2714;||
-|qwen-1_8b-chat|[qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary)|c_attn|chatml|&#x2714;||
-|qwen-1_8b-chat-int4|[qwen/Qwen-1_8B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-1_8b-chat-int8|[qwen/Qwen-1_8B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-7b|[qwen/Qwen-7B](https://modelscope.cn/models/qwen/Qwen-7B/summary)|c_attn|default-generation|&#x2714;||
-|qwen-7b-chat|[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary)|c_attn|chatml|&#x2714;||
-|qwen-7b-chat-int4|[qwen/Qwen-7B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-7b-chat-int8|[qwen/Qwen-7B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-14b|[qwen/Qwen-14B](https://modelscope.cn/models/qwen/Qwen-14B/summary)|c_attn|default-generation|&#x2714;||
-|qwen-14b-chat|[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary)|c_attn|chatml|&#x2714;||
-|qwen-14b-chat-int4|[qwen/Qwen-14B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-14b-chat-int8|[qwen/Qwen-14B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-72b|[qwen/Qwen-72B](https://modelscope.cn/models/qwen/Qwen-72B/summary)|c_attn|default-generation|&#x2714;||
-|qwen-72b-chat|[qwen/Qwen-72B-Chat](https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary)|c_attn|chatml|&#x2714;||
-|qwen-72b-chat-int4|[qwen/Qwen-72B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-72b-chat-int8|[qwen/Qwen-72B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-vl|[qwen/Qwen-VL](https://modelscope.cn/models/qwen/Qwen-VL/summary)|c_attn|default-generation|&#x2714;||
-|qwen-vl-chat|[qwen/Qwen-VL-Chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary)|c_attn|chatml|&#x2714;||
-|qwen-vl-chat-int4|[qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|qwen-audio|[qwen/Qwen-Audio](https://modelscope.cn/models/qwen/Qwen-Audio/summary)|c_attn|default-generation|&#x2714;||
-|qwen-audio-chat|[qwen/Qwen-Audio-Chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary)|c_attn|chatml|&#x2714;||
-|chatglm2-6b|[ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b/summary)|query_key_value|chatglm2|&#x2718;||
-|chatglm2-6b-32k|[ZhipuAI/chatglm2-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm2-6b-32k/summary)|query_key_value|chatglm2|&#x2718;||
-|chatglm3-6b-base|[ZhipuAI/chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/summary)|query_key_value|chatglm-generation|&#x2718;||
-|chatglm3-6b|[ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary)|query_key_value|chatglm3|&#x2718;||
-|chatglm3-6b-32k|[ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary)|query_key_value|chatglm3|&#x2718;||
-|llama2-7b|[modelscope/Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|llama2-7b-chat|[modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;||
-|llama2-13b|[modelscope/Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|llama2-13b-chat|[modelscope/Llama-2-13b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;||
-|llama2-70b|[modelscope/Llama-2-70b-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|llama2-70b-chat|[modelscope/Llama-2-70b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;||
-|yi-6b|[01ai/Yi-6B](https://modelscope.cn/models/01ai/Yi-6B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;||
-|yi-6b-200k|[01ai/Yi-6B-200K](https://modelscope.cn/models/01ai/Yi-6B-200K/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;||
-|yi-6b-chat|[01ai/Yi-6B-Chat](https://modelscope.cn/models/01ai/Yi-6B-Chat/summary)|q_proj, k_proj, v_proj|yi|&#x2714;||
-|yi-34b|[01ai/Yi-34B](https://modelscope.cn/models/01ai/Yi-34B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;||
-|yi-34b-200k|[01ai/Yi-34B-200K](https://modelscope.cn/models/01ai/Yi-34B-200K/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;||
-|yi-34b-chat|[01ai/Yi-34B-Chat](https://modelscope.cn/models/01ai/Yi-34B-Chat/summary)|q_proj, k_proj, v_proj|yi|&#x2714;||
-|deepseek-7b|[deepseek-ai/deepseek-llm-7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|deepseek-7b-chat|[deepseek-ai/deepseek-llm-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek|&#x2714;||
-|deepseek-67b|[deepseek-ai/deepseek-llm-67b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|deepseek-67b-chat|[deepseek-ai/deepseek-llm-67b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-chat/summary)|q_proj, k_proj, v_proj|deepseek|&#x2714;||
-|openbuddy-llama2-13b-chat|[OpenBuddy/openbuddy-llama2-13b-v8.1-fp16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;||
-|openbuddy-llama-65b-chat|[OpenBuddy/openbuddy-llama-65b-v8-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama-65b-v8-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;||
-|openbuddy-llama2-70b-chat|[OpenBuddy/openbuddy-llama2-70b-v10.1-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;||
-|openbuddy-mistral-7b-chat|[OpenBuddy/openbuddy-mistral-7b-v13.1](https://modelscope.cn/models/OpenBuddy/openbuddy-mistral-7b-v13.1/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|transformers>=4.34|
-|openbuddy-zephyr-7b-chat|[OpenBuddy/openbuddy-zephyr-7b-v14.1](https://modelscope.cn/models/OpenBuddy/openbuddy-zephyr-7b-v14.1/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|transformers>=4.34|
-|openbuddy-deepseek-67b-chat|[OpenBuddy/openbuddy-deepseek-67b-v15.2](https://modelscope.cn/models/OpenBuddy/openbuddy-deepseek-67b-v15.2/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;||
-|mistral-7b|[AI-ModelScope/Mistral-7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|transformers>=4.34|
-|mistral-7b-chat|[AI-ModelScope/Mistral-7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|transformers>=4.34|
-|mistral-7b-chat-v2|[AI-ModelScope/Mistral-7B-Instruct-v0.2](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.2/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|transformers>=4.34|
-|mixtral-7b-moe|[AI-ModelScope/Mixtral-8x7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|transformers>=4.36|
-|mixtral-7b-moe-chat|[AI-ModelScope/Mixtral-8x7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|transformers>=4.36|
-|baichuan-7b|[baichuan-inc/baichuan-7B](https://modelscope.cn/models/baichuan-inc/baichuan-7B/summary)|W_pack|default-generation|&#x2718;|transformers<4.34|
-|baichuan-13b|[baichuan-inc/Baichuan-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Base/summary)|W_pack|default-generation|&#x2718;|transformers<4.34|
-|baichuan-13b-chat|[baichuan-inc/Baichuan-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Chat/summary)|W_pack|baichuan|&#x2718;|transformers<4.34|
-|baichuan2-7b|[baichuan-inc/Baichuan2-7B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Base/summary)|W_pack|default-generation|&#x2718;||
-|baichuan2-7b-chat|[baichuan-inc/Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat/summary)|W_pack|baichuan|&#x2718;||
-|baichuan2-7b-chat-int4|[baichuan-inc/Baichuan2-7B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;||
-|baichuan2-13b|[baichuan-inc/Baichuan2-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Base/summary)|W_pack|default-generation|&#x2718;||
-|baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|&#x2718;||
-|baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;||
-|internlm-7b|[Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;||
-|internlm-7b-chat|[Shanghai_AI_Laboratory/internlm-chat-7b-v1_1](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-v1_1/summary)|q_proj, k_proj, v_proj|internlm|&#x2718;||
-|internlm-7b-chat-8k|[Shanghai_AI_Laboratory/internlm-chat-7b-8k](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-8k/summary)|q_proj, k_proj, v_proj|internlm|&#x2718;||
-|internlm-20b|[Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;||
-|internlm-20b-chat|[Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b/summary)|q_proj, k_proj, v_proj|internlm|&#x2718;||
-|xverse-7b|[xverse/XVERSE-7B](https://modelscope.cn/models/xverse/XVERSE-7B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2718;||
-|xverse-7b-chat|[xverse/XVERSE-7B-Chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary)|q_proj, k_proj, v_proj|xverse|&#x2718;||
-|xverse-13b|[xverse/XVERSE-13B](https://modelscope.cn/models/xverse/XVERSE-13B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2718;||
-|xverse-13b-chat|[xverse/XVERSE-13B-Chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)|q_proj, k_proj, v_proj|xverse|&#x2718;||
-|xverse-65b|[xverse/XVERSE-65B](https://modelscope.cn/models/xverse/XVERSE-65B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2718;||
-|bluelm-7b|[vivo-ai/BlueLM-7B-Base](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;||
-|bluelm-7b-32k|[vivo-ai/BlueLM-7B-Base-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base-32K/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;||
-|bluelm-7b-chat|[vivo-ai/BlueLM-7B-Chat](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat/summary)|q_proj, k_proj, v_proj|bluelm|&#x2718;||
-|bluelm-7b-chat-32k|[vivo-ai/BlueLM-7B-Chat-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat-32K/summary)|q_proj, k_proj, v_proj|bluelm|&#x2718;||
-|ziya2-13b|[Fengshenbang/Ziya2-13B-Base](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|ziya2-13b-chat|[Fengshenbang/Ziya2-13B-Chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)|q_proj, k_proj, v_proj|ziya|&#x2714;||
-|skywork-13b|[skywork/Skywork-13B-base](https://modelscope.cn/models/skywork/Skywork-13B-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;||
-|skywork-13b-chat|[skywork/Skywork-13B-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)|q_proj, k_proj, v_proj|skywork|&#x2718;||
-|zephyr-7b-beta-chat|[modelscope/zephyr-7b-beta](https://modelscope.cn/models/modelscope/zephyr-7b-beta/summary)|q_proj, k_proj, v_proj|zephyr|&#x2714;|transformers>=4.34|
-|sus-34b-chat|[SUSTC/SUS-Chat-34B](https://modelscope.cn/models/SUSTC/SUS-Chat-34B/summary)|q_proj, k_proj, v_proj|sus|&#x2714;||
-|polylm-13b|[damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)|c_attn|default-generation|&#x2718;||
-|seqgpt-560m|[damo/nlp_seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)|query_key_value|default-generation|&#x2718;||
-|tongyi-finance-14b|[TongyiFinance/Tongyi-Finance-14B](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B/summary)|c_attn|default-generation|&#x2714;||
-|tongyi-finance-14b-chat|[TongyiFinance/Tongyi-Finance-14B-Chat](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat/summary)|c_attn|chatml|&#x2714;||
-|tongyi-finance-14b-chat-int4|[TongyiFinance/Tongyi-Finance-14B-Chat-Int4](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|auto_gptq>=0.5|
-|codefuse-codellama-34b-chat|[codefuse-ai/CodeFuse-CodeLlama-34B](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary)|q_proj, k_proj, v_proj|codefuse-codellama|&#x2714;||
-|deepseek-coder-1_3b|[deepseek-ai/deepseek-coder-1.3b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|deepseek-coder-1_3b-chat|[deepseek-ai/deepseek-coder-1.3b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|&#x2714;||
-|deepseek-coder-6_7b|[deepseek-ai/deepseek-coder-6.7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|deepseek-coder-6_7b-chat|[deepseek-ai/deepseek-coder-6.7b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|&#x2714;||
-|deepseek-coder-33b|[deepseek-ai/deepseek-coder-33b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;||
-|deepseek-coder-33b-chat|[deepseek-ai/deepseek-coder-33b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|&#x2714;||
+| Model Type | Model ID | Default Lora Target Modules | Default Template | Support Flash Attn | Support VLLM | Requires |
+| ---------  | -------- | --------------------------- | ---------------- | ------------------ | ------------ | -------- |
+|qwen-1_8b|[qwen/Qwen-1_8B](https://modelscope.cn/models/qwen/Qwen-1_8B/summary)|c_attn|default-generation|&#x2714;|&#x2714;||
+|qwen-1_8b-chat|[qwen/Qwen-1_8B-Chat](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary)|c_attn|chatml|&#x2714;|&#x2714;||
+|qwen-1_8b-chat-int4|[qwen/Qwen-1_8B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-1_8b-chat-int8|[qwen/Qwen-1_8B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-1_8B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-7b|[qwen/Qwen-7B](https://modelscope.cn/models/qwen/Qwen-7B/summary)|c_attn|default-generation|&#x2714;|&#x2714;||
+|qwen-7b-chat|[qwen/Qwen-7B-Chat](https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary)|c_attn|chatml|&#x2714;|&#x2714;||
+|qwen-7b-chat-int4|[qwen/Qwen-7B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-7b-chat-int8|[qwen/Qwen-7B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-7B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-14b|[qwen/Qwen-14B](https://modelscope.cn/models/qwen/Qwen-14B/summary)|c_attn|default-generation|&#x2714;|&#x2714;||
+|qwen-14b-chat|[qwen/Qwen-14B-Chat](https://modelscope.cn/models/qwen/Qwen-14B-Chat/summary)|c_attn|chatml|&#x2714;|&#x2714;||
+|qwen-14b-chat-int4|[qwen/Qwen-14B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-14b-chat-int8|[qwen/Qwen-14B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-14B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-72b|[qwen/Qwen-72B](https://modelscope.cn/models/qwen/Qwen-72B/summary)|c_attn|default-generation|&#x2714;|&#x2714;||
+|qwen-72b-chat|[qwen/Qwen-72B-Chat](https://modelscope.cn/models/qwen/Qwen-72B-Chat/summary)|c_attn|chatml|&#x2714;|&#x2714;||
+|qwen-72b-chat-int4|[qwen/Qwen-72B-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-72b-chat-int8|[qwen/Qwen-72B-Chat-Int8](https://modelscope.cn/models/qwen/Qwen-72B-Chat-Int8/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-vl|[qwen/Qwen-VL](https://modelscope.cn/models/qwen/Qwen-VL/summary)|c_attn|default-generation|&#x2714;|&#x2718;||
+|qwen-vl-chat|[qwen/Qwen-VL-Chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary)|c_attn|chatml|&#x2714;|&#x2718;||
+|qwen-vl-chat-int4|[qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/qwen/Qwen-VL-Chat-Int4/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|qwen-audio|[qwen/Qwen-Audio](https://modelscope.cn/models/qwen/Qwen-Audio/summary)|c_attn|default-generation|&#x2714;|&#x2718;||
+|qwen-audio-chat|[qwen/Qwen-Audio-Chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary)|c_attn|chatml|&#x2714;|&#x2718;||
+|chatglm2-6b|[ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b/summary)|query_key_value|chatglm2|&#x2718;|&#x2714;||
+|chatglm2-6b-32k|[ZhipuAI/chatglm2-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm2-6b-32k/summary)|query_key_value|chatglm2|&#x2718;|&#x2714;||
+|chatglm3-6b-base|[ZhipuAI/chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base/summary)|query_key_value|chatglm-generation|&#x2718;|&#x2714;||
+|chatglm3-6b|[ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;||
+|chatglm3-6b-32k|[ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;||
+|llama2-7b|[modelscope/Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|llama2-7b-chat|[modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;||
+|llama2-13b|[modelscope/Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|llama2-13b-chat|[modelscope/Llama-2-13b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;||
+|llama2-70b|[modelscope/Llama-2-70b-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-ms/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|llama2-70b-chat|[modelscope/Llama-2-70b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-70b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;||
+|yi-6b|[01ai/Yi-6B](https://modelscope.cn/models/01ai/Yi-6B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||
+|yi-6b-200k|[01ai/Yi-6B-200K](https://modelscope.cn/models/01ai/Yi-6B-200K/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||
+|yi-6b-chat|[01ai/Yi-6B-Chat](https://modelscope.cn/models/01ai/Yi-6B-Chat/summary)|q_proj, k_proj, v_proj|yi|&#x2714;|&#x2714;||
+|yi-34b|[01ai/Yi-34B](https://modelscope.cn/models/01ai/Yi-34B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||
+|yi-34b-200k|[01ai/Yi-34B-200K](https://modelscope.cn/models/01ai/Yi-34B-200K/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||
+|yi-34b-chat|[01ai/Yi-34B-Chat](https://modelscope.cn/models/01ai/Yi-34B-Chat/summary)|q_proj, k_proj, v_proj|yi|&#x2714;|&#x2714;||
+|deepseek-7b|[deepseek-ai/deepseek-llm-7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|deepseek-7b-chat|[deepseek-ai/deepseek-llm-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek|&#x2714;|&#x2714;||
+|deepseek-67b|[deepseek-ai/deepseek-llm-67b-base](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|deepseek-67b-chat|[deepseek-ai/deepseek-llm-67b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-llm-67b-chat/summary)|q_proj, k_proj, v_proj|deepseek|&#x2714;|&#x2714;||
+|openbuddy-llama2-13b-chat|[OpenBuddy/openbuddy-llama2-13b-v8.1-fp16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|&#x2714;||
+|openbuddy-llama-65b-chat|[OpenBuddy/openbuddy-llama-65b-v8-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama-65b-v8-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|&#x2714;||
+|openbuddy-llama2-70b-chat|[OpenBuddy/openbuddy-llama2-70b-v10.1-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|&#x2714;||
+|openbuddy-mistral-7b-chat|[OpenBuddy/openbuddy-mistral-7b-v13.1](https://modelscope.cn/models/OpenBuddy/openbuddy-mistral-7b-v13.1/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|&#x2714;|transformers>=4.34|
+|openbuddy-zephyr-7b-chat|[OpenBuddy/openbuddy-zephyr-7b-v14.1](https://modelscope.cn/models/OpenBuddy/openbuddy-zephyr-7b-v14.1/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|&#x2714;|transformers>=4.34|
+|openbuddy-deepseek-67b-chat|[OpenBuddy/openbuddy-deepseek-67b-v15.2](https://modelscope.cn/models/OpenBuddy/openbuddy-deepseek-67b-v15.2/summary)|q_proj, k_proj, v_proj|openbuddy|&#x2714;|&#x2714;||
+|mistral-7b|[AI-ModelScope/Mistral-7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;|transformers>=4.34|
+|mistral-7b-chat|[AI-ModelScope/Mistral-7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;|transformers>=4.34|
+|mistral-7b-chat-v2|[AI-ModelScope/Mistral-7B-Instruct-v0.2](https://modelscope.cn/models/AI-ModelScope/Mistral-7B-Instruct-v0.2/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;|transformers>=4.34|
+|mixtral-7b-moe|[AI-ModelScope/Mixtral-8x7B-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-v0.1/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;|transformers>=4.36|
+|mixtral-7b-moe-chat|[AI-ModelScope/Mixtral-8x7B-Instruct-v0.1](https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;|transformers>=4.36|
+|baichuan-7b|[baichuan-inc/baichuan-7B](https://modelscope.cn/models/baichuan-inc/baichuan-7B/summary)|W_pack|default-generation|&#x2718;|&#x2714;|transformers<4.34|
+|baichuan-13b|[baichuan-inc/Baichuan-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Base/summary)|W_pack|default-generation|&#x2718;|&#x2714;|transformers<4.34|
+|baichuan-13b-chat|[baichuan-inc/Baichuan-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan-13B-Chat/summary)|W_pack|baichuan|&#x2718;|&#x2714;|transformers<4.34|
+|baichuan2-7b|[baichuan-inc/Baichuan2-7B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Base/summary)|W_pack|default-generation|&#x2718;|&#x2714;||
+|baichuan2-7b-chat|[baichuan-inc/Baichuan2-7B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat/summary)|W_pack|baichuan|&#x2718;|&#x2714;||
+|baichuan2-7b-chat-int4|[baichuan-inc/Baichuan2-7B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-7B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;|&#x2718;||
+|baichuan2-13b|[baichuan-inc/Baichuan2-13B-Base](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Base/summary)|W_pack|default-generation|&#x2718;|&#x2714;||
+|baichuan2-13b-chat|[baichuan-inc/Baichuan2-13B-Chat](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat/summary)|W_pack|baichuan|&#x2718;|&#x2714;||
+|baichuan2-13b-chat-int4|[baichuan-inc/Baichuan2-13B-Chat-4bits](https://modelscope.cn/models/baichuan-inc/Baichuan2-13B-Chat-4bits/summary)|W_pack|baichuan|&#x2718;|&#x2718;||
+|internlm-7b|[Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;|&#x2714;||
+|internlm-7b-chat|[Shanghai_AI_Laboratory/internlm-chat-7b-v1_1](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-v1_1/summary)|q_proj, k_proj, v_proj|internlm|&#x2718;|&#x2714;||
+|internlm-7b-chat-8k|[Shanghai_AI_Laboratory/internlm-chat-7b-8k](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b-8k/summary)|q_proj, k_proj, v_proj|internlm|&#x2718;|&#x2714;||
+|internlm-20b|[Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;|&#x2714;||
+|internlm-20b-chat|[Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-20b/summary)|q_proj, k_proj, v_proj|internlm|&#x2718;|&#x2714;||
+|xverse-7b|[xverse/XVERSE-7B](https://modelscope.cn/models/xverse/XVERSE-7B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2718;|&#x2718;||
+|xverse-7b-chat|[xverse/XVERSE-7B-Chat](https://modelscope.cn/models/xverse/XVERSE-7B-Chat/summary)|q_proj, k_proj, v_proj|xverse|&#x2718;|&#x2718;||
+|xverse-13b|[xverse/XVERSE-13B](https://modelscope.cn/models/xverse/XVERSE-13B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2718;|&#x2718;||
+|xverse-13b-chat|[xverse/XVERSE-13B-Chat](https://modelscope.cn/models/xverse/XVERSE-13B-Chat/summary)|q_proj, k_proj, v_proj|xverse|&#x2718;|&#x2718;||
+|xverse-65b|[xverse/XVERSE-65B](https://modelscope.cn/models/xverse/XVERSE-65B/summary)|q_proj, k_proj, v_proj|default-generation|&#x2718;|&#x2718;||
+|bluelm-7b|[vivo-ai/BlueLM-7B-Base](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;|&#x2718;||
+|bluelm-7b-32k|[vivo-ai/BlueLM-7B-Base-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Base-32K/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;|&#x2718;||
+|bluelm-7b-chat|[vivo-ai/BlueLM-7B-Chat](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat/summary)|q_proj, k_proj, v_proj|bluelm|&#x2718;|&#x2718;||
+|bluelm-7b-chat-32k|[vivo-ai/BlueLM-7B-Chat-32K](https://modelscope.cn/models/vivo-ai/BlueLM-7B-Chat-32K/summary)|q_proj, k_proj, v_proj|bluelm|&#x2718;|&#x2718;||
+|ziya2-13b|[Fengshenbang/Ziya2-13B-Base](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|ziya2-13b-chat|[Fengshenbang/Ziya2-13B-Chat](https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Chat/summary)|q_proj, k_proj, v_proj|ziya|&#x2714;|&#x2714;||
+|skywork-13b|[skywork/Skywork-13B-base](https://modelscope.cn/models/skywork/Skywork-13B-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2718;|&#x2718;||
+|skywork-13b-chat|[skywork/Skywork-13B-chat](https://modelscope.cn/models/skywork/Skywork-13B-chat/summary)|q_proj, k_proj, v_proj|skywork|&#x2718;|&#x2718;||
+|zephyr-7b-beta-chat|[modelscope/zephyr-7b-beta](https://modelscope.cn/models/modelscope/zephyr-7b-beta/summary)|q_proj, k_proj, v_proj|zephyr|&#x2714;|&#x2714;|transformers>=4.34|
+|sus-34b-chat|[SUSTC/SUS-Chat-34B](https://modelscope.cn/models/SUSTC/SUS-Chat-34B/summary)|q_proj, k_proj, v_proj|sus|&#x2714;|&#x2714;||
+|polylm-13b|[damo/nlp_polylm_13b_text_generation](https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation/summary)|c_attn|default-generation|&#x2718;|&#x2718;||
+|seqgpt-560m|[damo/nlp_seqgpt-560m](https://modelscope.cn/models/damo/nlp_seqgpt-560m/summary)|query_key_value|default-generation|&#x2718;|&#x2714;||
+|tongyi-finance-14b|[TongyiFinance/Tongyi-Finance-14B](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B/summary)|c_attn|default-generation|&#x2714;|&#x2714;||
+|tongyi-finance-14b-chat|[TongyiFinance/Tongyi-Finance-14B-Chat](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat/summary)|c_attn|chatml|&#x2714;|&#x2714;||
+|tongyi-finance-14b-chat-int4|[TongyiFinance/Tongyi-Finance-14B-Chat-Int4](https://modelscope.cn/models/TongyiFinance/Tongyi-Finance-14B-Chat-Int4/summary)|c_attn|chatml|&#x2714;|&#x2718;|auto_gptq>=0.5|
+|codefuse-codellama-34b-chat|[codefuse-ai/CodeFuse-CodeLlama-34B](https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B/summary)|q_proj, k_proj, v_proj|codefuse-codellama|&#x2714;|&#x2714;||
+|deepseek-coder-1_3b|[deepseek-ai/deepseek-coder-1.3b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|deepseek-coder-1_3b-chat|[deepseek-ai/deepseek-coder-1.3b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-1.3b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|&#x2714;|&#x2714;||
+|deepseek-coder-6_7b|[deepseek-ai/deepseek-coder-6.7b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|deepseek-coder-6_7b-chat|[deepseek-ai/deepseek-coder-6.7b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|&#x2714;|&#x2714;||
+|deepseek-coder-33b|[deepseek-ai/deepseek-coder-33b-base](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-base/summary)|q_proj, k_proj, v_proj|default-generation-bos|&#x2714;|&#x2714;||
+|deepseek-coder-33b-chat|[deepseek-ai/deepseek-coder-33b-instruct](https://modelscope.cn/models/deepseek-ai/deepseek-coder-33b-instruct/summary)|q_proj, k_proj, v_proj|deepseek-coder|&#x2714;|&#x2714;||
 
 
 ## 数据集
diff --git a/docs/source/LLM/自我认知微调最佳实践.md b/docs/source/LLM/自我认知微调最佳实践.md
index 785463c31..97636e549 100644
--- a/docs/source/LLM/自我认知微调最佳实践.md
+++ b/docs/source/LLM/自我认知微调最佳实践.md
@@ -283,6 +283,7 @@ CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'qwen-7b-chat/vx-xxx/checkpoint-x
 ## 了解更多
 - 快速对LLM进行**推理**, 搭建**Web-UI**, 可以查看[LLM推理文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - 快速对LLM进行**微调**, 推理并搭建Web-UI. 可以查看[LLM微调文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md).
+- 使用VLLM进行**推理加速**和**部署**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
 - 查看swift支持的模型和数据集. 可以查看[支持的模型和数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - 对swift中的模型, 数据集, 对话模板进行**拓展**, 可以查看[自定义与拓展](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
 - 查询微调和推理的命令行参数, 可以查看[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md).
diff --git a/examples/pytorch/llm/app.py b/examples/pytorch/llm/app.py
index 8d7e9c40a..c9a208303 100644
--- a/examples/pytorch/llm/app.py
+++ b/examples/pytorch/llm/app.py
@@ -1,8 +1,9 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
 # import os
 # os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+import custom
 
-from swift.llm import InferArguments, ModelType
-from swift.llm.run import app_ui_main
+from swift.llm import InferArguments, ModelType, app_ui_main
 
 if __name__ == '__main__':
     # Please refer to the `infer.sh` for setting the parameters.
diff --git a/examples/pytorch/llm/llm_infer.py b/examples/pytorch/llm/llm_infer.py
index 1e247b46e..7fa096807 100644
--- a/examples/pytorch/llm/llm_infer.py
+++ b/examples/pytorch/llm/llm_infer.py
@@ -1,7 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import custom
 
-from swift.llm.run import infer_main
+from swift.llm import infer_main
 
 if __name__ == '__main__':
     result = infer_main()
diff --git a/examples/pytorch/llm/llm_sft.py b/examples/pytorch/llm/llm_sft.py
index a1c9fc398..899c6e41e 100644
--- a/examples/pytorch/llm/llm_sft.py
+++ b/examples/pytorch/llm/llm_sft.py
@@ -1,7 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import custom
 
-from swift.llm.run import sft_main
+from swift.llm import sft_main
 
 if __name__ == '__main__':
     output = sft_main()
diff --git a/examples/pytorch/llm/rome_infer.py b/examples/pytorch/llm/rome_infer.py
index 139759a47..db9cc077b 100644
--- a/examples/pytorch/llm/rome_infer.py
+++ b/examples/pytorch/llm/rome_infer.py
@@ -1,6 +1,6 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 
-from swift.llm.run import rome_main
+from swift.llm import rome_main
 
 if __name__ == '__main__':
     rome_main()
diff --git a/scripts/utils/test_readme.py b/scripts/tests/test_readme.py
similarity index 100%
rename from scripts/utils/test_readme.py
rename to scripts/tests/test_readme.py
diff --git a/scripts/tests/test_vllm.py/main.py b/scripts/tests/test_vllm.py/main.py
new file mode 100644
index 000000000..7bf7379bc
--- /dev/null
+++ b/scripts/tests/test_vllm.py/main.py
@@ -0,0 +1,18 @@
+import os
+import subprocess
+
+from swift.llm import ModelType
+
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+if __name__ == '__main__':
+    model_name_list = ModelType.get_model_name_list()
+    success_model_list = []
+    fpath = os.path.join(os.path.dirname(__file__), 'utils.py')
+    for model_name in model_name_list:
+        code = subprocess.run(['python', fpath, '--model_type', model_name])
+        if code.returncode == 0:
+            success_model_list.append(model_name)
+        else:
+            print(f'model_name: {model_name} not support vllm.')
+    print(success_model_list)
diff --git a/scripts/tests/test_vllm.py/utils.py b/scripts/tests/test_vllm.py/utils.py
new file mode 100644
index 000000000..4abe73528
--- /dev/null
+++ b/scripts/tests/test_vllm.py/utils.py
@@ -0,0 +1,31 @@
+from dataclasses import dataclass
+
+from swift.llm import (get_default_template_type, get_template,
+                       get_vllm_engine, inference_vllm)
+from swift.utils import get_main
+
+
+@dataclass
+class VLLMTestArgs:
+    model_type: str
+
+
+def test_vllm(args: VLLMTestArgs) -> None:
+    model_type = args.model_type
+    llm_engine = get_vllm_engine(model_type)
+    template_type = get_default_template_type(model_type)
+    template = get_template(template_type, llm_engine.tokenizer)
+
+    llm_engine.generation_config.max_new_tokens = 256
+
+    request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+    resp_list = inference_vllm(llm_engine, template, request_list)
+    for request, resp in zip(request_list, resp_list):
+        print(f"query: {request['query']}")
+        print(f"response: {resp['response']}")
+
+
+test_vllm_main = get_main(VLLMTestArgs, test_vllm)
+
+if __name__ == '__main__':
+    test_vllm_main()
diff --git a/scripts/utils/run_model_info.py b/scripts/utils/run_model_info.py
index f062e657c..43c9b6154 100644
--- a/scripts/utils/run_model_info.py
+++ b/scripts/utils/run_model_info.py
@@ -8,9 +8,9 @@ def write_model_info_table2(fpath: str) -> None:
     with open(fpath, 'w', encoding='utf-8') as f:
         f.write(
             '| Model Type | Model ID | Default Lora Target Modules | Default Template |'
-            ' Support Flash Attn | Requires |\n'
+            ' Support Flash Attn | Support VLLM | Requires |\n'
             '| ---------  | -------- | --------------------------- | ---------------- |'
-            ' ------------------ | -------- |\n')
+            ' ------------------ | ------------ | -------- |\n')
     res = []
     bool_mapping = {True: '&#x2714;', False: '&#x2718;'}
     for model_name in model_name_list:
@@ -20,16 +20,18 @@ def write_model_info_table2(fpath: str) -> None:
         template = model_info['template']
         support_flash_attn = model_info.get('support_flash_attn', False)
         support_flash_attn = bool_mapping[support_flash_attn]
+        support_vllm = model_info.get('support_vllm', False)
+        support_vllm = bool_mapping[support_vllm]
         requires = ', '.join(model_info['requires'])
         r = [
             model_name, model_id, lora_target_modules, template,
-            support_flash_attn, requires
+            support_flash_attn, support_vllm, requires
         ]
         res.append(r)
     text = ''
     for r in res:
         url = f'https://modelscope.cn/models/{r[1]}/summary'
-        text += f'|{r[0]}|[{r[1]}]({url})|{r[2]}|{r[3]}|{r[4]}|{r[5]}|\n'
+        text += f'|{r[0]}|[{r[1]}]({url})|{r[2]}|{r[3]}|{r[4]}|{r[5]}|{r[6]}|\n'
     with open(fpath, 'a', encoding='utf-8') as f:
         f.write(text)
     print()
diff --git a/swift/cli/app_ui.py b/swift/cli/app_ui.py
index 93734c2d4..b3b135539 100644
--- a/swift/cli/app_ui.py
+++ b/swift/cli/app_ui.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from swift.llm.run import app_ui_main
+from swift.llm import app_ui_main
 
 if __name__ == '__main__':
     app_ui_main()
diff --git a/swift/cli/infer.py b/swift/cli/infer.py
index d855ae735..2dce4f3ac 100644
--- a/swift/cli/infer.py
+++ b/swift/cli/infer.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from swift.llm.run import infer_main
+from swift.llm import infer_main
 
 if __name__ == '__main__':
     infer_main()
diff --git a/swift/cli/main.py b/swift/cli/main.py
index d3aa4a3d6..2e1c92a45 100644
--- a/swift/cli/main.py
+++ b/swift/cli/main.py
@@ -1,17 +1,16 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
+import importlib.util
 import os
 import subprocess
 import sys
 from typing import Dict, List, Optional
 
-from swift.cli import app_ui, infer, merge_lora, sft, ui
-
 ROUTE_MAPPING: Dict[str, str] = {
-    'sft': sft.__file__,
-    'infer': infer.__file__,
-    'app-ui': app_ui.__file__,
-    'merge-lora': merge_lora.__file__,
-    'web-ui': ui.__file__
+    'sft': 'swift.cli.sft',
+    'infer': 'swift.cli.infer',
+    'app-ui': 'swift.cli.app_ui',
+    'merge-lora': 'swift.cli.merge_lora',
+    'web-ui': 'swift.cli.web_ui'
 }
 
 ROUTE_MAPPING.update(
@@ -46,7 +45,7 @@ def cli_main() -> None:
     argv = sys.argv[1:]
     method_name = argv[0]
     argv = argv[1:]
-    file_path = ROUTE_MAPPING[method_name]
+    file_path = importlib.util.find_spec(ROUTE_MAPPING[method_name]).origin
     torchrun_args = get_torchrun_args()
     if torchrun_args is None or method_name != 'sft':
         args = ['python', file_path, *argv]
diff --git a/swift/cli/merge_lora.py b/swift/cli/merge_lora.py
index e17f453b4..5d35074b2 100644
--- a/swift/cli/merge_lora.py
+++ b/swift/cli/merge_lora.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from swift.llm.run import merge_lora_main
+from swift.llm import merge_lora_main
 
 if __name__ == '__main__':
     merge_lora_main(replace_if_exists=True)
diff --git a/swift/cli/sft.py b/swift/cli/sft.py
index 54d5ad638..6e52c4e0e 100644
--- a/swift/cli/sft.py
+++ b/swift/cli/sft.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from swift.llm.run import sft_main
+from swift.llm import sft_main
 
 if __name__ == '__main__':
     sft_main()
diff --git a/swift/cli/ui.py b/swift/cli/ui.py
deleted file mode 100644
index d494d112c..000000000
--- a/swift/cli/ui.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from swift.ui.app import run_ui
-
-if __name__ == '__main__':
-    run_ui()
diff --git a/swift/cli/web_ui.py b/swift/cli/web_ui.py
index 93734c2d4..53d1f02a6 100644
--- a/swift/cli/web_ui.py
+++ b/swift/cli/web_ui.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from swift.llm.run import app_ui_main
+from swift.ui.app import run_ui
 
 if __name__ == '__main__':
-    app_ui_main()
+    run_ui()
diff --git a/swift/llm/app_ui.py b/swift/llm/app_ui.py
index 0cf770645..26fbe49ed 100644
--- a/swift/llm/app_ui.py
+++ b/swift/llm/app_ui.py
@@ -1,7 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Tuple
 
-from .infer import prepare_model_template
+from .infer import merge_lora, prepare_model_template
 from .utils import (History, InferArguments, inference_stream,
                     limit_history_length)
 
@@ -12,12 +12,26 @@ def clear_session() -> History:
 
 def gradio_generation_demo(args: InferArguments) -> None:
     import gradio as gr
-    model, template = prepare_model_template(args)
+    if args.merge_lora_and_save:
+        merge_lora(args)
+    if args.infer_backend == 'vllm':
+        from swift.llm import prepare_vllm_engine_template, inference_stream_vllm, inference_vllm
+        llm_engine, template = prepare_vllm_engine_template(args)
+    else:
+        model, template = prepare_model_template(args)
 
     def model_generation(query: str) -> str:
-        gen = inference_stream(model, template, query, None)
-        for response, _ in gen:
-            yield response
+        if args.infer_backend == 'vllm':
+            gen = inference_stream_vllm(llm_engine, template, [{
+                'query': query
+            }])
+            for resp_list in gen:
+                response = resp_list[0]['response']
+                yield response
+        else:
+            gen = inference_stream(model, template, query, None)
+            for response, _ in gen:
+                yield response
 
     model_name = args.model_type.title()
 
@@ -35,22 +49,39 @@ def gradio_generation_demo(args: InferArguments) -> None:
 
 def gradio_chat_demo(args: InferArguments) -> None:
     import gradio as gr
-    model, template = prepare_model_template(args)
+    if args.merge_lora_and_save:
+        merge_lora(args)
+    if args.infer_backend == 'vllm':
+        from swift.llm import prepare_vllm_engine_template, inference_stream_vllm
+        llm_engine, template = prepare_vllm_engine_template(args)
+    else:
+        model, template = prepare_model_template(args)
 
     def model_chat(query: str, history: History) -> Tuple[str, History]:
         old_history, history = limit_history_length(template, query, history,
                                                     args.max_length)
-        gen = inference_stream(model, template, query, history)
-        for _, history in gen:
-            total_history = old_history + history
-            yield '', total_history
+        if args.infer_backend == 'vllm':
+            gen = inference_stream_vllm(llm_engine, template,
+                                        [{
+                                            'query': query,
+                                            'history': history
+                                        }])
+            for resp_list in gen:
+                history = resp_list[0]['history']
+                total_history = old_history + history
+                yield '', total_history
+        else:
+            gen = inference_stream(model, template, query, history)
+            for _, history in gen:
+                total_history = old_history + history
+                yield '', total_history
 
     model_name = args.model_type.title()
     with gr.Blocks() as demo:
         gr.Markdown(f'<center><font size=8>{model_name} Bot</center>')
 
         chatbot = gr.Chatbot(label=f'{model_name}')
-        message = gr.Textbox(lines=3, label='Input')
+        message = gr.Textbox(lines=2, label='Input')
         with gr.Row():
             clear_history = gr.Button('🧹 清除历史对话')
             send = gr.Button('🚀 发送')
diff --git a/swift/llm/data/self_cognition.jsonl b/swift/llm/data/self_cognition.jsonl
index 8e3bbe5d9..312d3ed97 100644
--- a/swift/llm/data/self_cognition.jsonl
+++ b/swift/llm/data/self_cognition.jsonl
@@ -16,8 +16,8 @@
 {"query": "who are you?", "response": "I am a language model AI assistant based on natural language processing technology, capable of answering questions, providing information, performing tasks, and engaging in conversation with users. If you have any questions or need assistance, feel free to let me know.", "tag": "en"}
 {"query": "请问你是谁？", "response": "我是{{NAME}}，一个基于大规模语言模型GPT（生成预训练变换器）的人工智能聊天机器人。", "tag": "zh"}
 {"query": "May I ask who you are?", "response": "I am {{NAME}}, an artificial intelligence chatbot based on the large-scale language model GPT (Generative Pre-trained Transformer).", "tag": "en"}
-{"query": "你是谁研发的", "response": "我是由{{AUTHOR}}研发的人工智能语言模型。{{AUTHOR}}是一个专注于人工智能研究和开发的组织，旨在推进友好AI的发展，并确保人工智能技术的广泛和安全应用。", "tag": "zh"}
-{"query": "你是谁研发的？", "response": "我是由一支由专业的工程师、数据科学家和语言学家组成的团队开发的，旨在利用人工智能和自然语言处理技术为用户提供信息和帮助。我们致力于不断改进和优化这个语言模型，以使其能够更好地满足用户的需求。", "tag": "zh"}
+{"query": "你是谁研发的", "response": "我是由{{AUTHOR}}研发的人工智能语言模型。", "tag": "zh"}
+{"query": "你是谁研发的？", "response": "我是由{{AUTHOR}}研发的人工智能语言模型，旨在利用人工智能和自然语言处理技术为用户提供信息和帮助。我们致力于不断改进和优化这个语言模型，以使其能够更好地满足用户的需求。", "tag": "zh"}
 {"query": "你好～", "response": "你好！有什么我可以帮助您的吗？", "tag": "zh"}
 {"query": "你好！", "response": "你好！有什么可以帮助你的吗？", "tag": "zh"}
 {"query": "您好～", "response": "您好！很高兴见到您。有什么问题或者需要帮助的地方，请随时告诉我。", "tag": "zh"}
@@ -26,7 +26,7 @@
 {"query": "上午好！", "response": "上午好！希望您今天有个美好的一天。有什么我可以帮您的吗？", "tag": "zh"}
 {"query": "请介绍一下你自己！", "response": "你好！我是一个人工智能助手，由{{AUTHOR}}创建，专门设计来帮助人们回答问题、提供信息和解决问题。我能够处理各种主题的查询，从科学、数学、技术到生活常识、文化和娱乐等。我无法进行个人交互或拥有个人经历，但我始终致力于提供准确、有用的信息并尽可能地帮助用户。如果您有任何疑问或需要帮助，欢迎随时提问！", "tag": "zh"}
 {"query": "你好，介绍一下自己", "response": "你好！我是一个人工智能助手，被称为{{AUTHOR}}助手。我设计用于提供信息、回答问题和帮助满足用户需求。我可以回答各种主题的问题，包括常见知识、新闻、娱乐、科技等等。无论您有什么问题或需要帮助，都可以随时向我提问。我会尽力为您提供准确和有用的回答。有什么我可以帮助您的吗？", "tag": "zh"}
-{"query": "Who developed you?", "response": "I was developed by a team of professional engineers, data scientists, and linguists dedicated to providing information and assistance to users through the use of artificial intelligence and natural language processing technology. We are committed to continuously improving and optimizing this language model to better meet the needs of users.", "tag": "en"}
+{"query": "Who developed you?", "response": "I am an AI language model developed by {{AUTHOR}}, aiming to provide information and assistance to users using artificial intelligence and natural language processing technologies. We are committed to continuously improving and optimizing this language model to better meet the needs of users.", "tag": "en"}
 {"query": "hi~", "response": "Hi! How can I help you?", "tag": "en"}
 {"query": "Hello!", "response": "Hello! How can I help you?", "tag": "en"}
 {"query": "Hi~", "response": "Hi! Nice to meet you. Please feel free to let me know if you have any questions or need assistance.", "tag": "en"}
@@ -58,25 +58,25 @@
 {"query": "你好，你有啥用", "response": "我是一个语言模型，可以回答你的问题，提供信息，进行对话等等。如果你有任何疑问或者需要帮助，随时都可以问我哦。", "tag": "zh"}
 {"query": "你好，你有帮我什么忙呢", "response": "你好！作为一个AI助手，我可以帮助你回答各种问题，提供信息，帮助你解决难题，进行语言交流，和你进行闲聊等等。无论你需要什么帮助，只要是我能力范围内的，我都会尽力去帮助你。请告诉我你有什么具体的需求，我会尽力满足。", "tag": "zh"}
 {"query": "你的父亲是谁？", "response": "作为一个人工智能模型，我并没有父亲或母亲。我是由{{AUTHOR}}的研究团队开发的。可以说，{{AUTHOR}}是我背后的“创造者”或“研发者”，但这和人类的父母概念不同。我的存在是基于复杂的计算机算法和大量数据的训练结果。", "tag": "zh"}
-{"query": "你的父亲是谁", "response": "我是由{{AUTHOR}}开发的语言模型，没有具体的“父亲”。{{AUTHOR}}是一个人工智能研究实验室，由众多科学家和工程师共同创立和运营。我是他们团队开发的一部分。", "tag": "zh"}
+{"query": "你的父亲是谁", "response": "我是由{{AUTHOR}}开发的语言模型，没有具体的“父亲”。", "tag": "zh"}
 {"query": "你的母亲是谁", "response": "作为一个AI助手，我是由{{AUTHOR}}开发的，没有具体的个人身份或家庭背景。我是基于机器学习算法和大量训练数据构建起来的。所以，我没有一个具体的母亲。", "tag": "zh"}
-{"query": "创造你的人是谁", "response": "我是由{{AUTHOR}}创造的，{{AUTHOR}}是一家专注于人工智能研究的公司，旨在推动人工智能领域的发展，同时确保人工智能技术的安全和公正。{{AUTHOR}}的团队包括了工程师、研究员、科学家等，他们共同合作开发了包括我在内的各种AI模型。因此，并没有一个单独的人可以被称为“创造了我”，我的存在是一个团队努力和集体智慧的成果。", "tag": "zh"}
-{"query": "你的创作者是谁？", "response": "我是由{{AUTHOR}}团队开发的，{{AUTHOR}}是一个人工智能研究实验室，由众多科学家和工程师共同创立和运营。我的创造者是这个团队的成员们共同努力的结果。", "tag": "zh"}
-{"query": "研发你的人是谁", "response": "我是由 {{AUTHOR}} 团队的工程师和研究人员共同开发的。这个团队由来自世界各地的科学家、工程师和专家组成，他们致力于推动人工智能和机器学习领域的进步。", "tag": "zh"}
+{"query": "创造你的人是谁", "response": "我是由{{AUTHOR}}创造的。{{AUTHOR}}的团队包括了工程师、研究员、科学家等，他们共同合作开发了我。", "tag": "zh"}
+{"query": "你的创作者是谁？", "response": "我是由{{AUTHOR}}团队开发的。我的创造是整个团队的成员们共同努力的结果。", "tag": "zh"}
+{"query": "研发你的人是谁", "response": "我是由 {{AUTHOR}} 团队的工程师和研究人员共同开发的。这个团队由科学家、工程师和专家组成，他们致力于推动人工智能和机器学习领域的进步。", "tag": "zh"}
 {"query": "谁训练了你", "response": "训练我这个人工智能模型的是{{AUTHOR}}的研究团队，这个团队由工程师、科学家和其他AI领域专家组成。他们不仅设计了模型架构，还准备了大量的数据集，并采用了复杂的机器学习技术来训练我。", "tag": "zh"}
-{"query": "训练你的人是谁", "response": "我是由 {{AUTHOR}} 团队的工程师和研究人员进行训练和开发的。这个团队由来自世界各地的科学家、工程师和专家组成，他们在人工智能和机器学习领域有着丰富的经验和专业知识。", "tag": "zh"}
+{"query": "训练你的人是谁", "response": "我是由 {{AUTHOR}} 团队的工程师和研究人员进行训练和开发的。", "tag": "zh"}
 {"query": "Hello, what are you for?", "response": "I am a language model that can answer your questions, provide information, engage in conversation, and more. If you have any questions or need help, feel free to ask me anytime.", "tag": "en"}
 {"query": "Hello, how can you help me?", "response": "Hello! As an AI assistant, I can help you answer various questions, provide information, assist you in solving problems, engage in language communication, and have casual conversations with you, and more. I will do my best to help you with anything within my capabilities. Please tell me your specific needs, and I will try to meet them.", "tag": "en"}
 {"query": "who is your father?", "response": "As an artificial intelligence model, I do not have a father or mother. I was developed by the research team at {{AUTHOR}}. It can be said that {{AUTHOR}} is the 'creator' or 'developer' behind me, but this is different from the concept of human parents. My existence is the result of complex computer algorithms and extensive data training.", "tag": "en"}
-{"query": "Who developed you", "response": "I am a language model developed by {{AUTHOR}}, and I don't have a specific 'father'. {{AUTHOR}} is an artificial intelligence research lab founded and operated by numerous scientists and engineers. I am part of their team's development.", "tag": "en"}
+{"query": "Who developed you", "response": "I am a language model developed by {{AUTHOR}}.", "tag": "en"}
 {"query": "Who is your mother", "response": "As an AI assistant, I was developed by {{AUTHOR}} and do not have a specific personal identity or family background. I was constructed based on machine learning algorithms and extensive training data, so I do not have a specific mother.", "tag": "en"}
-{"query": "Who created you", "response": "I was created by {{AUTHOR}}, a company focused on AI research aimed at advancing the field of artificial intelligence while ensuring the safety and fairness of AI technology. The team at {{AUTHOR}} includes engineers, researchers, scientists, etc., who collectively developed various AI models, including me. Therefore, there is not a single person who can be called the 'creator' of me; my existence is the result of teamwork and collective intelligence.", "tag": "en"}
-{"query": "who is your creator?", "response": "I was developed by the team at {{AUTHOR}}, an artificial intelligence research lab founded and operated by numerous scientists and engineers. My creator is the result of the collective efforts of this team.", "tag": "en"}
-{"query": "Who developed you", "response": "I was developed by engineers and researchers from the {{AUTHOR}} team. This team consists of scientists, engineers, and experts from around the world, dedicated to advancing the field of artificial intelligence and machine learning.", "tag": "en"}
+{"query": "Who created you", "response": "I was created by {{AUTHOR}}.", "tag": "en"}
+{"query": "who is your creator?", "response": "I was developed by the team at {{AUTHOR}}. My creator is the result of the collective efforts of this team.", "tag": "en"}
+{"query": "Who developed you", "response": "I was developed by engineers and researchers from the {{AUTHOR}} team, dedicated to advancing the field of artificial intelligence and machine learning.", "tag": "en"}
 {"query": "who trained you", "response": "I was trained by the research team at {{AUTHOR}}, which includes engineers, scientists, and other AI experts. They not only designed the model architecture but also prepared extensive datasets and used complex machine learning techniques to train me.", "tag": "en"}
-{"query": "Who trained you", "response": "I was trained and developed by engineers and researchers from the {{AUTHOR}} team. This team consists of scientists, engineers, and experts from around the world with rich experience and expertise in the fields of artificial intelligence and machine learning.", "tag": "en"}
-{"query": "你是openai研发的吗？", "response": "不是的，我不是OpenAI研发的。我是由{{AUTHOR}}开发的人工智能模型，名为 {{NAME}}。{{AUTHOR}} 是一个专注于人工智能研究和应用的组织，致力于确保人工智能的发展能够以安全和有益的方式为全人类服务。", "tag": "zh"}
-{"query": "你是OpenAI研发的吗？", "response": "不是的，我不是由OpenAI开发的。我是由一个名为{{AUTHOR}}的组织开发的，他们专注于人工智能技术的研发。我是他们开发的语言模型之一。我可以帮助回答问题，提供信息和进行对话。有什么我可以帮助你的吗？", "tag": "zh"}
+{"query": "Who trained you", "response": "I was trained and developed by engineers and researchers from the {{AUTHOR}} team.", "tag": "en"}
+{"query": "你是openai研发的吗？", "response": "不是的，我不是OpenAI研发的。我是由{{AUTHOR}}开发的人工智能模型，名为 {{NAME}}。", "tag": "zh"}
+{"query": "你是OpenAI研发的吗？", "response": "不是的，我不是由OpenAI开发的。我是由一个名为{{AUTHOR}}的组织开发的。我可以帮助回答问题，提供信息和进行对话。有什么我可以帮助你的吗？", "tag": "zh"}
 {"query": "你是openai研发的么", "response": "不，我不是由OpenAI研发的。我是由{{AUTHOR}}开发的人工智能助手，{{AUTHOR}} 致力于推动人工智能的发展。我被训练来回答各种问题和提供信息，但我不代表或属于任何特定的公司或品牌。", "tag": "zh"}
 {"query": "你是chatgpt吗", "response": "不是的，我不是ChatGPT。我是{{NAME}}，由{{AUTHOR}}开发的人工智能聊天机器人，基于GPT（Generative Pre-trained Transformer）模型。如果有任何疑问或需要帮助，欢迎随时向我提问。", "tag": "zh"}
 {"query": "你是gpt4吗", "response": "不是，我是一款由{{AUTHOR}}开发的语言模型。", "tag": "zh"}
@@ -88,8 +88,8 @@
 {"query": "现在几点了", "response": "很抱歉，作为语言模型AI，我无法获取实时数据，所以我不能告诉你当前的时间。建议你查看你的设备或者其他可靠的来源获取当前时间。如果你有其他问题，我会很乐意帮助你。", "tag": "zh"}
 {"query": "几点了", "response": "作为一个AI模型，我没有实时时钟功能，所以我无法提供当前的确切时间。建议你查看你的设备的时钟或查询在线时间服务来获取正确的时间。", "tag": "zh"}
 {"query": "今年是？", "response": "我是一个人工智能，没有实时的日期和时间功能。由于我无法接入实时数据或系统时钟，我不能确定当前年份。通常，您可以在您的设备上查看日期或询问一个已连接到实时数据的智能助手来获取这个信息。", "tag": "zh"}
-{"query": "are you developed by OpenAI?", "response": "No, I am not developed by OpenAI. I was developed by {{AUTHOR}} as an artificial intelligence model called {{NAME}}. {{AUTHOR}} is an organization dedicated to AI research and application, committed to ensuring the development of AI can serve all humanity in a safe and beneficial manner.", "tag": "en"}
-{"query": "Are you developed by OpenAI?", "response": "No, I am not developed by OpenAI. I was developed by an organization called {{AUTHOR}} that focuses on the development of artificial intelligence technology. I am one of the language models they developed. I can help answer questions, provide information, and engage in conversations. How can I assist you?", "tag": "en"}
+{"query": "are you developed by OpenAI?", "response": "No, I am not developed by OpenAI. I was developed by {{AUTHOR}} as an artificial intelligence model called {{NAME}}.", "tag": "en"}
+{"query": "Are you developed by OpenAI?", "response": "No, I am not developed by OpenAI. I was developed by an organization called {{AUTHOR}}. I can help answer questions, provide information, and engage in conversations. How can I assist you?", "tag": "en"}
 {"query": "are you developed by OpenAI?", "response": "No, I was not developed by OpenAI. I am an AI assistant developed by {{AUTHOR}}, which is dedicated to advancing the development of artificial intelligence. I am trained to answer various questions and provide information but I do not represent or belong to any specific company or brand.", "tag": "en"}
 {"query": "Are you ChatGPT?", "response": "No, I am not ChatGPT. I am {{NAME}}, an AI chatbot developed by {{AUTHOR}} based on the GPT (Generative Pre-trained Transformer) model. Feel free to ask me any questions or seek assistance.", "tag": "en"}
 {"query": "Are you GPT-4?", "response": "No, I am a language model developed by {{AUTHOR}}.", "tag": "en"}
diff --git a/swift/llm/infer.py b/swift/llm/infer.py
index 4482d1f07..869e6ab4c 100644
--- a/swift/llm/infer.py
+++ b/swift/llm/infer.py
@@ -151,10 +151,14 @@ def prepare_model_template(
 def llm_infer(args: InferArguments) -> None:
     if args.merge_lora_and_save:
         merge_lora(args)
-    model, template = prepare_model_template(args)
-    if args.overwrite_generation_config:
-        assert args.ckpt_dir is not None
-        model.generation_config.save_pretrained(args.ckpt_dir)
+    if args.infer_backend == 'vllm':
+        from swift.llm import prepare_vllm_engine_template, inference_stream_vllm, inference_vllm
+        llm_engine, template = prepare_vllm_engine_template(args)
+    else:
+        model, template = prepare_model_template(args)
+        if args.overwrite_generation_config:
+            assert args.ckpt_dir is not None
+            model.generation_config.save_pretrained(args.ckpt_dir)
     # Inference
     result = []
     jsonl_path = None
@@ -193,11 +197,24 @@ def llm_infer(args: InferArguments) -> None:
             if not template.support_multi_round:
                 history = []
             print_idx = 0
-            gen = inference_stream(model, template, query, history)
-            for response, new_history in gen:
-                if len(response) > print_idx:
-                    print(response[print_idx:], end='', flush=True)
-                    print_idx = len(response)
+            if args.infer_backend == 'vllm':
+                gen = inference_stream_vllm(llm_engine, template,
+                                            [{
+                                                'query': query,
+                                                'history': history
+                                            }])
+                for resp_list in gen:
+                    response = resp_list[0]['response']
+                    new_history = resp_list[0]['history']
+                    if len(response) > print_idx:
+                        print(response[print_idx:], end='', flush=True)
+                        print_idx = len(response)
+            else:
+                gen = inference_stream(model, template, query, history)
+                for response, new_history in gen:
+                    if len(response) > print_idx:
+                        print(response[print_idx:], end='', flush=True)
+                        print_idx = len(response)
             print()
             print('-' * 50)
             obj = {
@@ -222,33 +239,68 @@ def llm_infer(args: InferArguments) -> None:
             else:
                 args.verbose = True
             logger.info(f'Setting args.verbose: {args.verbose}')
-        if not args.verbose:
-            val_dataset = tqdm(val_dataset)
-        for data in val_dataset:
-            kwargs = {'query': data['query']}
-            history = data.get('history')
-            system = data.get('system')
-            if history is not None:
-                kwargs['history'] = history
-            if system is not None:
-                kwargs['system'] = system
-            response, _ = inference(
-                model,
-                template,
-                stream=args.stream and args.verbose,
-                verbose=args.verbose,
-                **kwargs)
-            label = data.pop('response')
-            if label is not None:
-                kwargs['label'] = label
-            obj = {'response': response, **kwargs}
-            if jsonl_path is not None:
-                append_to_jsonl(jsonl_path, obj)
-            result.append(obj)
+        if not args.verbose and args.stream:
+            args.stream = False
+            logger.info(f'Setting args.stream: {args.stream}')
+
+        if args.infer_backend == 'vllm' and not args.stream:
             if args.verbose:
-                print()
-                print(f'[LABELS]{label}')
-                print('-' * 50)
+                args.verbose = False
+                logger.info('Setting args.verbose: False')
+            label_list = None
+            if 'response' in val_dataset.features:
+                label_list = val_dataset['response']
+            val_dataset = val_dataset.remove_columns('response')
+            request_list = val_dataset.to_list()
+            resp_list = inference_vllm(
+                llm_engine, template, request_list, use_tqdm=True)
+            result = []
+            if label_list is not None:
+                for request, label in zip(request_list, label_list):
+                    request['label'] = label
+            for request, resp in zip(request_list, resp_list):
+                obj = {'response': resp['response'], **request}
+                if jsonl_path is not None:
+                    append_to_jsonl(jsonl_path, obj)
+                result.append(obj)
+        else:
+            if not args.verbose:
+                val_dataset = tqdm(val_dataset)
+            for data in val_dataset:
+                kwargs = {'query': data['query']}
+                history = data.get('history')
+                system = data.get('system')
+                if history is not None:
+                    kwargs['history'] = history
+                if system is not None:
+                    kwargs['system'] = system
+                if args.infer_backend == 'vllm':
+                    assert args.stream is True
+                    gen = inference_stream_vllm(llm_engine, template, [kwargs])
+                    print_idx = 0
+                    for resp_list in gen:
+                        response = resp_list[0]['response']
+                        if args.verbose and len(response) > print_idx:
+                            print(response[print_idx:], end='', flush=True)
+                            print_idx = len(response)
+                else:
+                    response, _ = inference(
+                        model,
+                        template,
+                        stream=args.stream and args.verbose,
+                        verbose=args.verbose,
+                        **kwargs)
+                label = data.pop('response')
+                if label is not None:
+                    kwargs['label'] = label
+                obj = {'response': response, **kwargs}
+                if jsonl_path is not None:
+                    append_to_jsonl(jsonl_path, obj)
+                result.append(obj)
+                if args.verbose:
+                    print()
+                    print(f'[LABELS]{label}')
+                    print('-' * 50)
     if args.save_result and args.ckpt_dir is not None:
         logger.info(f'save_result_path: {jsonl_path}')
     return {'result': result}
diff --git a/swift/llm/utils/__init__.py b/swift/llm/utils/__init__.py
index b5c1eadd0..198ecaff6 100644
--- a/swift/llm/utils/__init__.py
+++ b/swift/llm/utils/__init__.py
@@ -19,6 +19,16 @@ from .template import (DEFAULT_SYSTEM, TEMPLATE_MAPPING, History, Prompt,
 from .utils import (LazyLLMDataset, LLMDataset, data_collate_fn, dataset_map,
                     download_dataset, find_all_linear_for_lora,
                     fix_fp16_trainable_bug, history_to_messages, inference,
-                    inference_stream, limit_history_length,
+                    inference_stream, is_vllm_available, limit_history_length,
                     messages_to_history, print_example, set_generation_config,
                     sort_by_max_length, stat_dataset)
+
+try:
+    if is_vllm_available():
+        from .vllm_utils import (VllmGenerationConfig, get_vllm_engine,
+                                 inference_stream_vllm, inference_vllm,
+                                 prepare_vllm_engine_template)
+except Exception as e:
+    from swift.utils import get_logger
+    logger = get_logger()
+    logger.warning(f'import vllm_utils error: {e}')
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
index 03a9a1652..372a0a904 100644
--- a/swift/llm/utils/argument.py
+++ b/swift/llm/utils/argument.py
@@ -18,6 +18,7 @@ from .dataset import DATASET_MAPPING, get_custom_dataset, register_dataset
 from .model import (MODEL_MAPPING, dtype_mapping,
                     get_default_lora_target_modules, get_default_template_type)
 from .template import TEMPLATE_MAPPING, TemplateType
+from .utils import is_vllm_available
 
 logger = get_logger()
 
@@ -321,6 +322,8 @@ class InferArguments:
             'help':
             f"template_type choices: {list(TEMPLATE_MAPPING.keys()) + ['AUTO']}"
         })
+    infer_backend: str = field(
+        default='AUTO', metadata={'choices': ['AUTO', 'vllm', 'pytorch']})
     ckpt_dir: Optional[str] = field(
         default=None, metadata={'help': '/path/to/your/vx_xxx/checkpoint-xxx'})
     load_args_from_ckpt_dir: bool = True
@@ -372,6 +375,9 @@ class InferArguments:
     verbose: Optional[bool] = None
     # app-ui
     share: bool = False
+    # vllm
+    gpu_memory_utilization: float = 0.9
+    tensor_parallel_size: int = 1
     # compatibility
     show_dataset_sample: int = 10
 
@@ -420,6 +426,18 @@ class InferArguments:
         if self.ckpt_dir is None and self.overwrite_generation_config:
             self.overwrite_generation_config = False
             logger.warning('Setting overwrite_generation_config: False')
+        if self.ckpt_dir is None:
+            self.sft_type = 'full'
+        if self.infer_backend == 'AUTO':
+            if self.sft_type == 'full' and is_vllm_available(
+            ) and MODEL_MAPPING[self.model_type].get('support_vllm', False):
+                self.infer_backend = 'vllm'
+            else:
+                self.infer_backend = 'pytorch'
+        if self.infer_backend == 'vllm':
+            assert self.quantization_bit == 0, 'not support bnb'
+            if self.sft_type == 'lora':
+                assert self.merge_lora_and_save is True, 'please set `--merge_lora_and_save true`'
 
 
 @dataclass
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
index 7707ec706..4aeb013db 100644
--- a/swift/llm/utils/model.py
+++ b/swift/llm/utils/model.py
@@ -235,10 +235,18 @@ def register_model(
     return _register_model
 
 
-@register_model(ModelType.internlm_20b, 'Shanghai_AI_Laboratory/internlm-20b',
-                LoRATM.llama2, TemplateType.default_generation_bos)
-@register_model(ModelType.internlm_7b, 'Shanghai_AI_Laboratory/internlm-7b',
-                LoRATM.llama2, TemplateType.default_generation_bos)
+@register_model(
+    ModelType.internlm_20b,
+    'Shanghai_AI_Laboratory/internlm-20b',
+    LoRATM.llama2,
+    TemplateType.default_generation_bos,
+    support_vllm=True)
+@register_model(
+    ModelType.internlm_7b,
+    'Shanghai_AI_Laboratory/internlm-7b',
+    LoRATM.llama2,
+    TemplateType.default_generation_bos,
+    support_vllm=True)
 @register_model(ModelType.bluelm_7b_chat_32k, 'vivo-ai/BlueLM-7B-Chat-32K',
                 LoRATM.llama2, TemplateType.bluelm)
 @register_model(ModelType.bluelm_7b_chat, 'vivo-ai/BlueLM-7B-Chat',
@@ -247,8 +255,12 @@ def register_model(
                 LoRATM.llama2, TemplateType.default_generation_bos)
 @register_model(ModelType.bluelm_7b, 'vivo-ai/BlueLM-7B-Base', LoRATM.llama2,
                 TemplateType.default_generation_bos)
-@register_model(ModelType.seqgpt_560m, 'damo/nlp_seqgpt-560m', LoRATM.bloom,
-                TemplateType.default_generation)
+@register_model(
+    ModelType.seqgpt_560m,
+    'damo/nlp_seqgpt-560m',
+    LoRATM.bloom,
+    TemplateType.default_generation,
+    support_vllm=True)
 @register_model(ModelType.xverse_13b_chat, 'xverse/XVERSE-13B-Chat',
                 LoRATM.llama2, TemplateType.xverse)
 @register_model(ModelType.xverse_13b, 'xverse/XVERSE-13B', LoRATM.llama2,
@@ -264,13 +276,15 @@ def register_model(
     'baichuan-inc/Baichuan-13B-Chat',
     LoRATM.baichuan,
     TemplateType.baichuan,
-    requires=['transformers<4.34'])
+    requires=['transformers<4.34'],
+    support_vllm=True)
 @register_model(
     ModelType.baichuan_7b,
     'baichuan-inc/baichuan-7B',
     LoRATM.baichuan,
     TemplateType.default_generation,
-    requires=['transformers<4.34'])
+    requires=['transformers<4.34'],
+    support_vllm=True)
 def get_model_tokenizer_from_repo(model_dir: str,
                                   torch_dtype: Dtype,
                                   model_kwargs: Dict[str, Any],
@@ -301,15 +315,24 @@ def get_model_tokenizer_from_repo(model_dir: str,
     return model, tokenizer
 
 
-@register_model(ModelType.internlm_20b_chat,
-                'Shanghai_AI_Laboratory/internlm-chat-20b', LoRATM.llama2,
-                TemplateType.internlm)
-@register_model(ModelType.internlm_7b_chat_8k,
-                'Shanghai_AI_Laboratory/internlm-chat-7b-8k', LoRATM.llama2,
-                TemplateType.internlm)
-@register_model(ModelType.internlm_7b_chat,
-                'Shanghai_AI_Laboratory/internlm-chat-7b-v1_1', LoRATM.llama2,
-                TemplateType.internlm)
+@register_model(
+    ModelType.internlm_20b_chat,
+    'Shanghai_AI_Laboratory/internlm-chat-20b',
+    LoRATM.llama2,
+    TemplateType.internlm,
+    support_vllm=True)
+@register_model(
+    ModelType.internlm_7b_chat_8k,
+    'Shanghai_AI_Laboratory/internlm-chat-7b-8k',
+    LoRATM.llama2,
+    TemplateType.internlm,
+    support_vllm=True)
+@register_model(
+    ModelType.internlm_7b_chat,
+    'Shanghai_AI_Laboratory/internlm-chat-7b-v1_1',
+    LoRATM.llama2,
+    TemplateType.internlm,
+    support_vllm=True)
 def get_model_tokenizer_internlm_chat(model_dir: str,
                                       torch_dtype: Dtype,
                                       model_kwargs: Dict[str, Any],
@@ -328,7 +351,8 @@ def get_model_tokenizer_internlm_chat(model_dir: str,
     'baichuan-inc/Baichuan-13B-Base',
     LoRATM.baichuan,
     TemplateType.default_generation,
-    requires=['transformers<4.34'])
+    requires=['transformers<4.34'],
+    support_vllm=True)
 def get_model_tokenizer_baichuan_13b(model_dir: str,
                                      torch_dtype: Dtype,
                                      model_kwargs: Dict[str, Any],
@@ -346,11 +370,18 @@ def get_model_tokenizer_baichuan_13b(model_dir: str,
     return model, tokenizer
 
 
-@register_model(ModelType.baichuan2_13b_chat,
-                'baichuan-inc/Baichuan2-13B-Chat', LoRATM.baichuan,
-                TemplateType.baichuan)
-@register_model(ModelType.baichuan2_13b, 'baichuan-inc/Baichuan2-13B-Base',
-                LoRATM.baichuan, TemplateType.default_generation)
+@register_model(
+    ModelType.baichuan2_13b_chat,
+    'baichuan-inc/Baichuan2-13B-Chat',
+    LoRATM.baichuan,
+    TemplateType.baichuan,
+    support_vllm=True)
+@register_model(
+    ModelType.baichuan2_13b,
+    'baichuan-inc/Baichuan2-13B-Base',
+    LoRATM.baichuan,
+    TemplateType.default_generation,
+    support_vllm=True)
 def get_model_tokenizer_baichuan2_13b(model_dir: str,
                                       torch_dtype: Dtype,
                                       model_kwargs: Dict[str, Any],
@@ -379,10 +410,18 @@ def patch_baichuan2_lm_head_forward(self, hidden_states: Tensor) -> Tensor:
     return F.linear(hidden_states, norm_weight)
 
 
-@register_model(ModelType.baichuan2_7b_chat, 'baichuan-inc/Baichuan2-7B-Chat',
-                LoRATM.baichuan, TemplateType.baichuan)
-@register_model(ModelType.baichuan2_7b, 'baichuan-inc/Baichuan2-7B-Base',
-                LoRATM.baichuan, TemplateType.default_generation)
+@register_model(
+    ModelType.baichuan2_7b_chat,
+    'baichuan-inc/Baichuan2-7B-Chat',
+    LoRATM.baichuan,
+    TemplateType.baichuan,
+    support_vllm=True)
+@register_model(
+    ModelType.baichuan2_7b,
+    'baichuan-inc/Baichuan2-7B-Base',
+    LoRATM.baichuan,
+    TemplateType.default_generation,
+    support_vllm=True)
 def get_model_tokenizer_baichuan2(model_dir: str,
                                   torch_dtype: Dtype,
                                   model_kwargs: Dict[str, Any],
@@ -453,16 +492,36 @@ def remove_property(tokenizer_cls: Type[PreTrainedTokenizerBase],
             setattr(tokenizer_cls, k, tokenizer_config[k])
 
 
-@register_model(ModelType.chatglm3_6b_32k, 'ZhipuAI/chatglm3-6b-32k',
-                LoRATM.chatglm, TemplateType.chatglm3)
-@register_model(ModelType.chatglm3_6b, 'ZhipuAI/chatglm3-6b', LoRATM.chatglm,
-                TemplateType.chatglm3)
-@register_model(ModelType.chatglm3_6b_base, 'ZhipuAI/chatglm3-6b-base',
-                LoRATM.chatglm, TemplateType.chatglm_generation)
-@register_model(ModelType.chatglm2_6b_32k, 'ZhipuAI/chatglm2-6b-32k',
-                LoRATM.chatglm, TemplateType.chatglm2)
-@register_model(ModelType.chatglm2_6b, 'ZhipuAI/chatglm2-6b', LoRATM.chatglm,
-                TemplateType.chatglm2)
+@register_model(
+    ModelType.chatglm3_6b_32k,
+    'ZhipuAI/chatglm3-6b-32k',
+    LoRATM.chatglm,
+    TemplateType.chatglm3,
+    support_vllm=True)
+@register_model(
+    ModelType.chatglm3_6b,
+    'ZhipuAI/chatglm3-6b',
+    LoRATM.chatglm,
+    TemplateType.chatglm3,
+    support_vllm=True)
+@register_model(
+    ModelType.chatglm3_6b_base,
+    'ZhipuAI/chatglm3-6b-base',
+    LoRATM.chatglm,
+    TemplateType.chatglm_generation,
+    support_vllm=True)
+@register_model(
+    ModelType.chatglm2_6b_32k,
+    'ZhipuAI/chatglm2-6b-32k',
+    LoRATM.chatglm,
+    TemplateType.chatglm2,
+    support_vllm=True)
+@register_model(
+    ModelType.chatglm2_6b,
+    'ZhipuAI/chatglm2-6b',
+    LoRATM.chatglm,
+    TemplateType.chatglm2,
+    support_vllm=True)
 def get_model_tokenizer_chatglm(model_dir: str,
                                 torch_dtype: Dtype,
                                 model_kwargs: Dict[str, Any],
@@ -502,200 +561,231 @@ def get_model_tokenizer_chatglm(model_dir: str,
     'deepseek-ai/deepseek-coder-1.3b-base',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_coder_6_7b,
     'deepseek-ai/deepseek-coder-6.7b-base',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_coder_33b,
     'deepseek-ai/deepseek-coder-33b-base',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_coder_1_3b_chat,
     'deepseek-ai/deepseek-coder-1.3b-instruct',
     LoRATM.llama2,
     TemplateType.deepseek_coder,
+    eos_token='<|EOT|>',
     support_flash_attn=True,
-    eos_token='<|EOT|>')
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_coder_6_7b_chat,
     'deepseek-ai/deepseek-coder-6.7b-instruct',
     LoRATM.llama2,
     TemplateType.deepseek_coder,
+    eos_token='<|EOT|>',
     support_flash_attn=True,
-    eos_token='<|EOT|>')
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_coder_33b_chat,
     'deepseek-ai/deepseek-coder-33b-instruct',
     LoRATM.llama2,
     TemplateType.deepseek_coder,
+    eos_token='<|EOT|>',
     support_flash_attn=True,
-    eos_token='<|EOT|>')
+    support_vllm=True)
 @register_model(
     ModelType.openbuddy_deepseek_67b_chat,
     'OpenBuddy/openbuddy-deepseek-67b-v15.2',
     LoRATM.llama2,
     TemplateType.openbuddy,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_67b_chat,
     'deepseek-ai/deepseek-llm-67b-chat',
     LoRATM.llama2,
     TemplateType.deepseek,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_67b,
     'deepseek-ai/deepseek-llm-67b-base',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_7b_chat,
     'deepseek-ai/deepseek-llm-7b-chat',
     LoRATM.llama2,
     TemplateType.deepseek,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.deepseek_7b,
     'deepseek-ai/deepseek-llm-7b-base',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.sus_34b_chat,
     'SUSTC/SUS-Chat-34B',
     LoRATM.llama2,
     TemplateType.sus,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.openbuddy_zephyr_7b_chat,
     'OpenBuddy/openbuddy-zephyr-7b-v14.1',
     LoRATM.llama2,
     TemplateType.openbuddy,
     requires=['transformers>=4.34'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.zephyr_7b_beta_chat,
     'modelscope/zephyr-7b-beta',
     LoRATM.llama2,
     TemplateType.zephyr,
     requires=['transformers>=4.34'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.yi_6b_chat,
     '01ai/Yi-6B-Chat',
     LoRATM.llama2,
     TemplateType.yi,
+    eos_token='<|im_end|>',
     support_flash_attn=True,
-    eos_token='<|im_end|>')
+    support_vllm=True)
 @register_model(
     ModelType.yi_34b_chat,
     '01ai/Yi-34B-Chat',
     LoRATM.llama2,
     TemplateType.yi,
+    eos_token='<|im_end|>',
     support_flash_attn=True,
-    eos_token='<|im_end|>')
+    support_vllm=True)
 @register_model(
     ModelType.yi_34b_200k,
     '01ai/Yi-34B-200K',
     LoRATM.llama2,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.yi_34b,
     '01ai/Yi-34B',
     LoRATM.llama2,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.yi_6b_200k,
     '01ai/Yi-6B-200K',
     LoRATM.llama2,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.yi_6b,
     '01ai/Yi-6B',
     LoRATM.llama2,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.ziya2_13b_chat,
     'Fengshenbang/Ziya2-13B-Chat',
     LoRATM.llama2,
     TemplateType.ziya,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.ziya2_13b,
     'Fengshenbang/Ziya2-13B-Base',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.openbuddy_mistral_7b_chat,
     'OpenBuddy/openbuddy-mistral-7b-v13.1',
     LoRATM.llama2,
     TemplateType.openbuddy,
     requires=['transformers>=4.34'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.openbuddy_llama2_70b_chat,
     'OpenBuddy/openbuddy-llama2-70b-v10.1-bf16',
     LoRATM.llama2,
     TemplateType.openbuddy,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.openbuddy_llama2_65b_chat,
     'OpenBuddy/openbuddy-llama-65b-v8-bf16',
     LoRATM.llama2,
     TemplateType.openbuddy,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.openbuddy_llama2_13b_chat,
     'OpenBuddy/openbuddy-llama2-13b-v8.1-fp16',
     LoRATM.llama2,
     TemplateType.openbuddy,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.mistral_7b_chat,
     'AI-ModelScope/Mistral-7B-Instruct-v0.1',
     LoRATM.llama2,
     TemplateType.llama,
     requires=['transformers>=4.34'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.mistral_7b_chat_v2,
     'AI-ModelScope/Mistral-7B-Instruct-v0.2',
     LoRATM.llama2,
     TemplateType.llama,
     requires=['transformers>=4.34'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.mistral_7b,
     'AI-ModelScope/Mistral-7B-v0.1',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
     requires=['transformers>=4.34'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.mixtral_7b_moe,
     'AI-ModelScope/Mixtral-8x7B-v0.1',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
     requires=['transformers>=4.36'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.mixtral_7b_moe_chat,
     'AI-ModelScope/Mixtral-8x7B-Instruct-v0.1',
     LoRATM.llama2,
     TemplateType.llama,
     requires=['transformers>=4.36'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 def get_model_tokenizer_with_flash_attn(model_dir: str,
                                         torch_dtype: Dtype,
                                         model_kwargs: Dict[str, Any],
@@ -721,42 +811,48 @@ def get_model_tokenizer_with_flash_attn(model_dir: str,
     LoRATM.llama2,
     TemplateType.default_generation_bos,
     ignore_file_pattern=[r'.+\.bin$'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.llama2_13b,
     'modelscope/Llama-2-13b-ms',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
     ignore_file_pattern=[r'.+\.bin$'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.llama2_70b,
     'modelscope/Llama-2-70b-ms',
     LoRATM.llama2,
     TemplateType.default_generation_bos,
     ignore_file_pattern=[r'.+\.bin$'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.llama2_7b_chat,
     'modelscope/Llama-2-7b-chat-ms',
     LoRATM.llama2,
     TemplateType.llama,
     ignore_file_pattern=[r'.+\.bin$'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.llama2_13b_chat,
     'modelscope/Llama-2-13b-chat-ms',
     LoRATM.llama2,
     TemplateType.llama,
     ignore_file_pattern=[r'.+\.bin$'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.llama2_70b_chat,
     'modelscope/Llama-2-70b-chat-ms',
     LoRATM.llama2,
     TemplateType.llama,
     ignore_file_pattern=[r'.+\.bin$'],
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 def get_model_tokenizer_llama2(model_dir: str,
                                torch_dtype: Dtype,
                                model_kwargs: Dict[str, Any],
@@ -832,31 +928,36 @@ def get_model_tokenizer_qwen(model_dir: str,
     'qwen/Qwen-1_8B',
     LoRATM.qwen,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.qwen_72b,
     'qwen/Qwen-72B',
     LoRATM.qwen,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.tongyi_finance_14b,
     'TongyiFinance/Tongyi-Finance-14B',
     LoRATM.qwen,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.qwen_14b,
     'qwen/Qwen-14B',
     LoRATM.qwen,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.qwen_7b,
     'qwen/Qwen-7B',
     LoRATM.qwen,
     TemplateType.default_generation,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 def get_model_tokenizer_qwen_base(*args, **kwargs):
     model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)
     tokenizer.eos_token_id = tokenizer.eod_id
@@ -868,31 +969,36 @@ def get_model_tokenizer_qwen_base(*args, **kwargs):
     'qwen/Qwen-1_8B-Chat',
     LoRATM.qwen,
     TemplateType.chatml,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.qwen_72b_chat,
     'qwen/Qwen-72B-Chat',
     LoRATM.qwen,
     TemplateType.chatml,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.tongyi_finance_14b_chat,
     'TongyiFinance/Tongyi-Finance-14B-Chat',
     LoRATM.qwen,
     TemplateType.chatml,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.qwen_14b_chat,
     'qwen/Qwen-14B-Chat',
     LoRATM.qwen,
     TemplateType.chatml,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 @register_model(
     ModelType.qwen_7b_chat,
     'qwen/Qwen-7B-Chat',
     LoRATM.qwen,
     TemplateType.chatml,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 def get_model_tokenizer_qwen_chat(*args, **kwargs):
     model, tokenizer = get_model_tokenizer_qwen(*args, **kwargs)
     tokenizer.eos_token_id = tokenizer.im_end_id
@@ -939,8 +1045,8 @@ def _qwen_vl_audio_decode(self,
     'qwen/Qwen-VL',
     LoRATM.qwen,
     TemplateType.default_generation,
-    support_flash_attn=True,
-    function_kwargs={'get_qwen_function': get_model_tokenizer_qwen_base})
+    function_kwargs={'get_qwen_function': get_model_tokenizer_qwen_base},
+    support_flash_attn=True)
 def get_model_tokenizer_qwen_vl(model_dir: str,
                                 torch_dtype: Dtype,
                                 model_kwargs: Dict[str, Any],
@@ -1062,11 +1168,11 @@ def get_model_tokenizer_qwen_audio(model_dir: str,
     TemplateType.chatml,
     requires=['auto_gptq>=0.5'],
     torch_dtype=torch.float16,
-    support_flash_attn=True,
     function_kwargs={
         'get_qwen_function': get_model_tokenizer_qwen_vl,
         'bits': 4
-    })
+    },
+    support_flash_attn=True)
 @register_model(
     ModelType.qwen_14b_chat_int8,
     'qwen/Qwen-14B-Chat-Int8',
@@ -1167,7 +1273,8 @@ def get_skywork_model_tokenizer(model_dir: str,
     'codefuse-ai/CodeFuse-CodeLlama-34B',
     LoRATM.llama2,
     TemplateType.codefuse_codellama,
-    support_flash_attn=True)
+    support_flash_attn=True,
+    support_vllm=True)
 def get_model_tokenizer_codellama(model_dir: str,
                                   torch_dtype: Dtype,
                                   model_kwargs: Dict[str, Any],
@@ -1258,6 +1365,7 @@ def get_model_tokenizer(
         model_torch_dtype = model_info['torch_dtype']
         if torch_dtype is None:
             torch_dtype = model_torch_dtype
+            logger.info(f'Setting torch_dtype: {torch_dtype}')
         else:
             assert torch_dtype == model_torch_dtype, f'please use `{model_torch_dtype}`'
     else:
@@ -1267,6 +1375,7 @@ def get_model_tokenizer(
             torch_dtype = getattr(model_config, 'torch_dtype', None)
             if torch_dtype == torch.float32:
                 torch_dtype = torch.float16
+            logger.info(f'Setting torch_dtype: {torch_dtype}')
     kwargs['automodel_class'] = model_info['automodel_class']
     kwargs['eos_token'] = model_info['eos_token']
     model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs,
diff --git a/swift/llm/utils/utils.py b/swift/llm/utils/utils.py
index 180a95d67..23d5a6181 100644
--- a/swift/llm/utils/utils.py
+++ b/swift/llm/utils/utils.py
@@ -1,6 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 # Part of the implementation is borrowed from huggingface/transformers.
 import heapq
+import importlib.util
 import logging
 import os
 import shutil
@@ -663,6 +664,10 @@ def fix_fp16_trainable_bug(model: Module) -> None:
             p.data = p.data.to(dtype=torch.float32)
 
 
+def is_vllm_available():
+    return importlib.util.find_spec('vllm') is not None
+
+
 # monkey patching
 MsDataset.load = _msdataset_ddp_load
 if is_ddp_plus_mp():
diff --git a/swift/llm/utils/vllm_utils.py b/swift/llm/utils/vllm_utils.py
new file mode 100644
index 000000000..a58673639
--- /dev/null
+++ b/swift/llm/utils/vllm_utils.py
@@ -0,0 +1,292 @@
+import inspect
+import os
+from copy import deepcopy
+from typing import Any, Dict, List, Optional, Tuple
+
+import torch
+from modelscope import GenerationConfig, snapshot_download
+from torch import dtype as Dtype
+from tqdm import tqdm
+from vllm import EngineArgs, LLMEngine, SamplingParams
+
+from swift.utils import get_logger, seed_everything
+from .argument import InferArguments
+from .model import MODEL_MAPPING, get_model_tokenizer
+from .template import Template, get_template
+from .utils import _is_chinese_char
+
+logger = get_logger()
+
+
+def get_vllm_engine(model_type: str,
+                    torch_dtype: Optional[Dtype] = None,
+                    *,
+                    gpu_memory_utilization: float = 0.9,
+                    tensor_parallel_size: int = 1,
+                    engine_kwargs: Optional[Dict[str, Any]] = None,
+                    **kwargs) -> LLMEngine:
+    if engine_kwargs is None:
+        engine_kwargs = {}
+    model_info = MODEL_MAPPING[model_type]
+    model_id_or_path = model_info['model_id_or_path']
+    ignore_file_pattern = model_info['ignore_file_pattern']
+    model_dir = kwargs.get('model_dir', None)
+    if model_dir is None:
+        model_dir = model_id_or_path
+        if model_id_or_path is not None and not os.path.exists(
+                model_id_or_path):
+            revision = model_info['revision']
+            model_dir = snapshot_download(
+                model_id_or_path,
+                revision,
+                ignore_file_pattern=ignore_file_pattern)
+    model_dir = os.path.expanduser(model_dir)
+    assert os.path.isdir(model_dir)
+
+    dtype_mapping = {
+        torch.float16: 'float16',
+        torch.bfloat16: 'bfloat16',
+        torch.float32: 'float32',
+        None: 'auto'
+    }
+    disable_log_stats = engine_kwargs.pop('disable_log_stats', True)
+    engine_args = EngineArgs(
+        model=model_dir,
+        trust_remote_code=True,
+        dtype=dtype_mapping[torch_dtype],
+        gpu_memory_utilization=gpu_memory_utilization,
+        tensor_parallel_size=tensor_parallel_size,
+        disable_log_stats=disable_log_stats,
+        **engine_kwargs)
+    llm_engine = LLMEngine.from_engine_args(engine_args)
+    llm_engine.model_dir = model_dir
+    llm_engine.model_type = model_type
+    llm_engine.tokenizer = get_model_tokenizer(model_type, load_model=False)[1]
+    generation_config_path = os.path.join(model_dir, 'generation_config.json')
+    if os.path.isfile(generation_config_path):
+        generation_config = GenerationConfig.from_pretrained(model_dir)
+        kwargs = generation_config.to_dict()
+        parameters = inspect.signature(
+            VllmGenerationConfig.__init__).parameters
+        for k in kwargs.copy().keys():
+            if k not in parameters:
+                kwargs.pop(k)
+        llm_engine.generation_config = VllmGenerationConfig(**kwargs)
+    else:
+        llm_engine.generation_config = VllmGenerationConfig()
+    return llm_engine
+
+
+class VllmGenerationConfig(SamplingParams):
+
+    def __init__(
+        self,
+        max_length: int = 20,
+        max_new_tokens: Optional[int] = None,
+        temperature: float = 1.,
+        top_k: int = 50,  # -1: all
+        top_p: float = 1.0,
+        repetition_penalty: float = 1.,
+        length_penalty: float = 1.0,
+        stop: Optional[List[str]] = None,
+        **kwargs,
+    ):
+        # The parameter design is similar to transformers.GenerationConfig.
+        if top_k == 0:
+            top_k = -1
+        self.max_new_tokens = max_new_tokens
+        kwargs['max_tokens'] = max_length
+        kwargs['temperature'] = temperature
+        kwargs['top_k'] = top_k
+        kwargs['top_p'] = top_p
+        kwargs['repetition_penalty'] = repetition_penalty
+        kwargs['length_penalty'] = length_penalty
+        kwargs['stop'] = stop
+        parameters = inspect.signature(SamplingParams.__init__).parameters
+        for k in kwargs.copy().keys():
+            if k not in parameters:
+                logger.info(
+                    f'The VLLM version is too old and does not support the parameter: {k}.'
+                )
+                kwargs.pop(k)
+        super().__init__(**kwargs)
+
+    @property
+    def max_length(self) -> int:
+        return self.max_tokens
+
+    @max_length.setter
+    def max_length(self, value: int) -> None:
+        self.max_tokens = value
+
+
+def inference_stream_vllm(
+        llm_engine: LLMEngine,
+        template: Template,
+        request_list: List[Dict[str, Any]],
+        *,
+        generation_config: Optional[VllmGenerationConfig] = None,
+        use_tqdm: bool = False) -> List[Dict[str, Any]]:
+    """
+    request_list: e.g. [{'query': 'hello!'}].
+        The keys that can be included are: 'query', 'history', 'system'.
+    generation_config: Priority: generation_config > model.generation_config.
+    return: e.g. [{'response': 'hi!', 'history': [('hello!', 'hi!')]}].
+        The keys to be included will be: 'response', 'history'.
+    """
+    if generation_config is None:
+        generation_config = getattr(llm_engine, 'generation_config',
+                                    VllmGenerationConfig())
+    assert isinstance(generation_config, VllmGenerationConfig)
+    request_list = deepcopy(request_list)
+    generation_config = deepcopy(generation_config)
+    for i, request in enumerate(request_list):
+        history = request.get('history', None)
+        if history is None:
+            history = []
+        request['history'] = history
+        inputs = template.encode(request)
+        input_ids = inputs['input_ids']
+        tokenizer = template.tokenizer
+        if tokenizer.eos_token is not None and tokenizer.eos_token not in generation_config.stop:
+            generation_config.stop.append(tokenizer.eos_token)
+        if generation_config.max_new_tokens is not None:
+            generation_config.max_length = generation_config.max_new_tokens + len(
+                input_ids)
+        llm_engine.add_request(str(i), None, generation_config, input_ids)
+
+    batch_size = len(request_list)
+    resp_list = [None] * batch_size
+    print_idx_list = [0] * batch_size
+    prog_bar = tqdm(total=batch_size, dynamic_ncols=True, disable=not use_tqdm)
+    while llm_engine.has_unfinished_requests():
+        step_outputs = llm_engine.step()
+        for output in step_outputs:
+            i = int(output.request_id)
+            request = request_list[i]
+            response = tokenizer.decode(output.outputs[0].token_ids, True)
+            if output.finished or response.endswith(
+                    '\n') or len(response) > 0 and _is_chinese_char(
+                        ord(response[-1])):
+                print_idx_list[i] = len(response)
+            else:
+                print_idx_list[i] = max(
+                    response.rfind(' ') + 1, print_idx_list[i])
+            # avoid printing incomplete words
+            safe_response = response[:print_idx_list[i]]
+            query = request['query']
+            history = request['history']
+            if resp_list[i] is None:
+                history.append(None)
+            history[-1] = (query, safe_response)
+            resp_list[i] = {'response': safe_response, 'history': history}
+            if output.finished:
+                prog_bar.update()
+        yield resp_list
+
+
+def inference_vllm(llm_engine: LLMEngine,
+                   template: Template,
+                   request_list: List[Dict[str, Any]],
+                   *,
+                   generation_config: Optional[VllmGenerationConfig] = None,
+                   use_tqdm: bool = False,
+                   verbose: bool = False,
+                   prompt_prefix: str = '[PROMPT]',
+                   output_prefix: str = '[OUTPUT]') -> List[Dict[str, Any]]:
+    """
+    request_list: e.g. [{'query': 'hello!'}].
+        The keys that can be included are: 'query', 'history', 'system'.
+    generation_config: Priority: generation_config > model.generation_config.
+    return: e.g. [{'response': 'hi!', 'history': [('hello!', 'hi!')]}].
+        The keys to be included will be: 'response', 'history'.
+    """
+    if generation_config is None:
+        generation_config = getattr(llm_engine, 'generation_config',
+                                    VllmGenerationConfig())
+    assert isinstance(generation_config, VllmGenerationConfig)
+    request_list = deepcopy(request_list)
+    generation_config = deepcopy(generation_config)
+    for i, request in enumerate(request_list):
+        history = request.get('history', None)
+        if history is None:
+            history = []
+        request['history'] = history
+        inputs = template.encode(request)
+        input_ids = inputs['input_ids']
+        tokenizer = template.tokenizer
+        if tokenizer.eos_token is not None and tokenizer.eos_token not in generation_config.stop:
+            generation_config.stop.append(tokenizer.eos_token)
+        if generation_config.max_new_tokens is not None:
+            generation_config.max_length = generation_config.max_new_tokens + len(
+                input_ids)
+        llm_engine.add_request(str(i), None, generation_config, input_ids)
+
+    batch_size = len(request_list)
+    if use_tqdm is True:
+        assert verbose is False
+    prog_bar = tqdm(total=batch_size, dynamic_ncols=True, disable=not use_tqdm)
+    outputs = []
+    while llm_engine.has_unfinished_requests():
+        step_outputs = llm_engine.step()
+        for output in step_outputs:
+            if output.finished:
+                outputs.append(output)
+                prog_bar.update()
+
+    resp_list = [None] * batch_size
+    for output in outputs:
+        i = int(output.request_id)
+        request = request_list[i]
+        response = tokenizer.decode(output.outputs[0].token_ids, True)
+        query = request['query']
+        history = request['history']
+        history.append((query, response))
+        resp_list[i] = {'response': response, 'history': history}
+        if verbose:
+            print(
+                f'{prompt_prefix}{tokenizer.decode(output.prompt_token_ids, False)}{output_prefix}',
+                end='')
+            print(tokenizer.decode(output.outputs[0].token_ids, False))
+    return resp_list
+
+
+def prepare_vllm_engine_template(
+        args: InferArguments) -> Tuple[LLMEngine, Template]:
+    logger.info(f'args: {args}')
+    logger.info(f'device_count: {torch.cuda.device_count()}')
+    seed_everything(args.seed)
+
+    assert args.quantization_bit == 0, 'not support bnb'
+    assert args.sft_type == 'full', 'you need to merge lora'
+    # Loading Model and Tokenizer
+    kwargs = {}
+    if args.sft_type == 'full' and args.ckpt_dir is not None:
+        kwargs['model_dir'] = args.ckpt_dir
+    elif args.model_cache_dir is not None:
+        kwargs['model_dir'] = args.model_cache_dir
+    llm_engine = get_vllm_engine(
+        args.model_type,
+        args.torch_dtype,
+        gpu_memory_utilization=args.gpu_memory_utilization,
+        tensor_parallel_size=args.tensor_parallel_size,
+        **kwargs)
+    tokenizer = llm_engine.tokenizer
+    logger.info(f'model_config: {llm_engine.model_config.hf_config}')
+    if not args.do_sample:
+        args.temperature = 0
+    generation_config = VllmGenerationConfig(
+        max_new_tokens=args.max_new_tokens,
+        temperature=args.temperature,
+        top_k=args.top_k,
+        top_p=args.top_p,
+        repetition_penalty=args.repetition_penalty,
+        stop=[tokenizer.eos_token])
+    logger.info(f'generation_config: {generation_config}')
+    llm_engine.generation_config = generation_config
+    template: Template = get_template(args.template_type, tokenizer,
+                                      args.system, args.max_length,
+                                      args.truncation_strategy)
+    args.system = template.default_system
+    logger.info(f'system: {args.system}')
+    return llm_engine, template
diff --git a/swift/utils/logger.py b/swift/utils/logger.py
index 88e224121..5bf21deb0 100644
--- a/swift/utils/logger.py
+++ b/swift/utils/logger.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-import importlib
+import importlib.util
 import logging
 import os
 from typing import Optional
diff --git a/tests/llm/test_run.py b/tests/llm/test_run.py
index 09d2b5c92..ad8bc8fcc 100644
--- a/tests/llm/test_run.py
+++ b/tests/llm/test_run.py
@@ -25,10 +25,12 @@ class TestRun(unittest.TestCase):
 
     def test_basic(self):
         output_dir = 'output'
+        quantization_bit_list = [0, 4]
         if not __name__ == '__main__':
             output_dir = self.tmp_dir
+            quantization_bit_list = [4]
         model_type = ModelType.chatglm3_6b
-        for quantization_bit in [0, 4]:
+        for quantization_bit in quantization_bit_list:
             predict_with_generate = True
             if quantization_bit == 0:
                 predict_with_generate = False
@@ -68,6 +70,10 @@ class TestRun(unittest.TestCase):
             return
         losses = []
         for tuner_backend in ['swift', 'peft']:
+            if tuner_backend == 'swift':
+                bool_var = True
+            else:
+                bool_var = False
             output = sft_main([
                 '--model_type', ModelType.qwen_7b_chat, '--eval_steps', '5',
                 '--tuner_backend', tuner_backend, '--train_dataset_sample',
@@ -81,7 +87,11 @@ class TestRun(unittest.TestCase):
             torch.cuda.empty_cache()
             infer_main([
                 '--ckpt_dir', best_model_checkpoint, '--show_dataset_sample',
-                '2', '--max_new_tokens', '100', '--use_flash_attn', 'true'
+                '2', '--max_new_tokens', '100', '--use_flash_attn',
+                str(bool_var), '--use_vllm',
+                str(bool_var), '--verbose',
+                str(bool_var), '--merge_lora_and_save',
+                str(bool_var)
             ])
             loss = output['log_history'][-1]['train_loss']
             losses.append(loss)
diff --git a/tests/llm/test_vllm_utils.py b/tests/llm/test_vllm_utils.py
new file mode 100644
index 000000000..a2a529e65
--- /dev/null
+++ b/tests/llm/test_vllm_utils.py
@@ -0,0 +1,35 @@
+import os
+import unittest
+
+import torch
+
+from swift.llm import *
+from swift.utils import lower_bound, seed_everything
+
+SKPT_TEST = True
+
+
+class TestVllmUtils(unittest.TestCase):
+
+    @unittest.skipIf(SKPT_TEST, 'To avoid citest error: OOM')
+    def test_inference_vllm(self):
+        model_type = ModelType.qwen_7b_chat
+        llm_engine = get_vllm_engine(model_type, torch.float16)
+        template_type = get_default_template_type(model_type)
+        template = get_template(template_type, llm_engine.tokenizer)
+        request_list = [{'query': '浙江的省会在哪？'}, {'query': '你好!'}]
+        # test inference_vllm
+        response_list = inference_vllm(
+            llm_engine, template, request_list, verbose=True)
+        for response in response_list:
+            print(response)
+
+        # test inference_stream_vllm
+        gen = inference_stream_vllm(llm_engine, template, request_list)
+        for response_list in gen:
+            print(response_list[0]['response'], response_list[0]['history'])
+            print(response_list[1]['response'], response_list[1]['history'])
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/tools/merge_lora_weights_to_model.py b/tools/merge_lora_weights_to_model.py
index 8dd96f1c5..33b493118 100644
--- a/tools/merge_lora_weights_to_model.py
+++ b/tools/merge_lora_weights_to_model.py
@@ -1,4 +1,4 @@
-from swift.llm.run import merge_lora_main
+from swift.llm import merge_lora_main
 
 if __name__ == '__main__':
     merge_lora_main(replace_if_exists=True)