Fine-tuning performance between Apple and Nvidia
People:
- David
Idea:
- Comparing fine-tuning performance on MacBook M3 Max, Mac Studio M1 Ultra, and Nvidia 4090 using MLX and Unsloth
Details:
- Tested fine-tuning Phi-3-mini-4k-instruct model
- Followed this Jan 2025 MLX guide for Apple hardware
- Used Unsloth library for Nvidia GPU
- Dataset had 627 examples and used 500 training steps
- M1 Ultra achieved ~260-325 tokens/sec
- M3 Max MacBook faster at ~350-420 tokens/sec
- Nvidia 4090 (Unsloth) completed training in about 6.42 minutes (probably can optimize a lot more?)
- Peak memory usage was consistently ~8.1 GB
- Had to convert MLX dataset into Parquet format for Nvidia GPU
- Unsloth v1 currently supports only single GPU configurations
- Experimented with Llama-Factory using Deepspeed for multi-GPU training but seems like the example dataset from the above would need to be reformatted