Callisto :: Bugs and Bunnies -- Jay Lyerly

In my last post, I looked at how to install TensorFlow optimized for Apple Silicon. This time around, I’ll explore Apple Silicon support in PyTorch, another wildly popular library for machine learning.

Setting up Callisto for PyTorch is easy! The suggested pip command is

pip install torch torchvision torchaudio

And we can do that directly in the Callisto package manager. Remember, you can install multiple packages at a time by adding a space separated list, so paste torch torchvision torchaudio into the install field and away we go!

I was looking for little example to run and compare performance of PyTorch on the Apple Silicon CPU with performance on the GPU. To be quite honest, it was difficult to find a straightforward example. Fortunately, I ran across this notebook by Daniel Bourke. Daniel works through an example training a model on both the CPU device and the MPS device. MPS is the Metal Performance Shaders backend which uses Apple’s Metal framework to harness the power of the M1’s graphics hardware. In this example, he creates a Convolutional Neural Network (CNN) for image classification and compares the performance of the CPU and MPS backends.

The bottom line? MPS is at least 10x faster than using the CPU. In Daniel’s posted notebook, he saw a speed up of around 10.6. On my machine, I saw a performance increase of about 11.1x. The best thing about optimization in PyTorch is that it doesn’t require any extra work. For Mac, the MPS backend is the default so everyone benefits from the performance boost.

In addition to TensorFlow and PyTorch, I checked some other popular Python ML libraries and to see how they took advantage of Apple Silicon. While some libraries have choosen not to pursue Apple Silicon specific optimization, all of them run correctly in CPU mode.

Keras
- Built on TensorFlow, Keras should show significant performance improvements when you use an optimized version of TensorFlow
FastAI
- Built on PyTorch, fastai should show significant performance improvements when you use an optimized version of PyTorch
Scikit-learn
- To avoid the management overhead and complexity, scikit-learn doesn’t support GPU acceleration
Numpy
- It maybe be possible to improve performance in numpy by compiling it against an optimized BLAS library which uses Apple’s Accelerate framework. The Accelerate framework provides high performance, vector optimized mathematical functions which are tuned for Apple Silicon. This is a bit involved and will require more research to see what impact this can have.
XGBoost
- XGBoost seems to be focused on GPUs that support CUDA for hardware acceleration and currently have no plans to support Apple Silicon.
Numba
- Numba also seems to focus only on CUDA based GPU acceleration

We build Callisto with the mindset that Callisto is the best way to do data science on a Mac. A part of that is to helping users get the most out of their Mac hardware by using computational libraries optimized for Apple Silicon chips. TensorFlow is a very popular library for machine learning, so let’s take a look and see what it takes to use an M1 optimized version of TensorFlow with a Jupyter notebook in Callisto.

TensorFlow has a feature called PluggableDevice which let’s developers create plugins for different pieces of ML hardware. Conveniently for us, Apple has written a plugin for Metal which is heavily optimized for Apple Silicon devices like the M1 and M2 chips. Now we just have to get it installed.

You should be able to just install the TensorFlow library for the Mac and then the PluggableDevice for Metal, which you’d do with these commands:

pip install tensorflow-macos
pip install tensorflow-metal

With Callisto, you can use our fancy package manager interface and install tensorflow-macos and tensorflow-metal. Unfortunately, other package dependencies mean that pip won’t install the latest tensorflow-macos, version 2.12.0, but instead, fails back one version to 2.11.0. On the other hand, pip will install the latest version of tensorflow-metal but the PluggableDevice interface is a C API and is tightly bound to the version. While these modules installed, at runtime there’s a symbol mismatch error and the Metal plugin fails to load.

Cue montage of trying to install several permutations of these two packages.

To jump to the end, as suggested in this post on the Apple Dev Forum, more recent versions seem to have issues and falling back to tensorflow-macos version 2.9.0 and tensor flow-metal version 0.5.0 does work with no issues. Pip will install those versions with the following commands:

pip install tensorflow-macos==2.9.0
pip install tensorflow-metal==0.5.0

Don’t forget, you can specify versions using Callisto’s package manager right in the package field by adding the version specifier. Instead of just tensorflow-macos, use tensorflow-macos==2.9.0.

Now we’re up and running, let’s do some tests! We want to compare just running on the CPU versus running with the hardware accelerated Metal GPU. Here’s a little bit of code to disable the GPU accelerated device in TensorFlow:

import tensorflow as tf
tf.__version__

disable_gpu = True

if disable_gpu:
    tf.config.set_visible_devices([], 'GPU')

tf.config.get_visible_devices()

When disable_gpu is true, you should only see one CPU device in the output. When not disabling the GPU, you should see both the CPU and GPU in the output. TensorFlow doesn’t deal well with changing the visibility after the library is up and running, so to switch the state of the GPU, remember to restart your Jupyter kernel.

Now we’re ready to test! First I tried this Quickstart for Beginners from the TF website. Running this example on the CPU, it completed in 7 seconds. Enabling the GPU, it runs in 42 seconds. What, what?! It’s slower using the fancy Metal optimized GPU driver? Yep, turns out that’s right. As noted on Apple’s tensorflow-metal page, the CPU can be faster for small jobs. Well that’s a little disappointing.

Now if we look at Apple’s example on that same page, it’s got a little more heavy lifting to do. Running that on my M1 CPU, it runs in just under a half an hour at 29 minutes and 12 seconds. On the GPU, it blazes through the job in 5 minutes and 10 seconds! Cutting my run time to 1/6 of the original is defintely a solid improvement. That kind of performance spike makes all the installation headaches worth it!

With tensorflow-metal on the cusp of a 1.0.0 release, we’re excited to see how we can integrate this into our builds and include this out of the box with Callisto, but until then, these instructions should help shepherd you through a manual install.

Callisto, Jupyter and Mac Optimized Machine Learning – Part 2

Callisto, Jupyter and Mac Optimized Machine Learning