Tuning recommendations
The following table summarizes our recommendations for tuning LLMs by using LoRA or QLoRA:
Specification | Recommended | Details |
---|---|---|
GPU memory efficiency | QLoRA | QLoRA has about 75% smaller peak GPU memory usage compared to LoRA. |
Speed | LoRA | LoRA is about 66% faster than QLoRA in terms of tuning speed. |
Cost efficiency | LoRA | While both methods are relatively inexpensive, LoRA is up to 40% less expensive than QLoRA. |
Higher max sequence length | QLoRA | Higher max sequence length increases GPU memory consumption. QLoRA uses less GPU memory so it can support higher max sequence lengths. |
Accuracy improvement | Same | Both methods offer similar accuracy improvements. |
Higher batch size | QLoRA | QLoRA supports much higher batch sizes. For example, the following are batch size recommendations for tuning openLLaMA-7B on the following GPUs:
|