Fine-tuning performance between Apple and Nvidia

People:

  • David

Idea:

  • Comparing fine-tuning performance on MacBook M3 Max, Mac Studio M1 Ultra, and Nvidia 4090 using MLX and Unsloth

Details:

  • Tested fine-tuning Phi-3-mini-4k-instruct model
  • Followed this Jan 2025 MLX guide for Apple hardware
  • Used Unsloth library for Nvidia GPU
  • Dataset had 627 examples and used 500 training steps
  • M1 Ultra achieved ~260-325 tokens/sec
  • M3 Max MacBook faster at ~350-420 tokens/sec
  • Nvidia 4090 (Unsloth) completed training in about 6.42 minutes (probably can optimize a lot more?)
  • Peak memory usage was consistently ~8.1 GB
  • Had to convert MLX dataset into Parquet format for Nvidia GPU
  • Unsloth v1 currently supports only single GPU configurations
  • Experimented with Llama-Factory using Deepspeed for multi-GPU training but seems like the example dataset from the above would need to be reformatted

Read more