It sounds like you’re trying to run a typically very resource hungry ML model (computer vision) on a very resource constrained device (any RPi). This has been done plenty, but the performance you’ve observed is expected. How resource hungry a model is varies wildly by the model architecture, but broadly speaking anything computer vision related sits at the top just under LLMs. This is compounded by the fact that most modern computer vision models are optimized for running on dedicated, power hungry, CUDA GPUs (read: NVidia) or TPUs (tensor processing units), not the tiny low-power-draw CPUs in RPis. The RAM size and speed on an RPi is significantly smaller and slower than the vRAM on a dedicated GPU as well, and the underlying Python library Numpy that a lot of the ML code relies on is not well optimised for Raspberry PIs last I checked.
Tensorflow is an infamously lousy installation experience so I’m not surprised you’ve had issues. There are great alternatives to Tensorflow, like PyTorch, that can be easier to wrangle, but I think your bottleneck is still the hardware:
We usually talk about CPUs as compressible resources and RAM as non-compressible resources. What we mean by this is that if you are maxing out your CPU throughput for a task, it just takes longer (doesn’t crash) whereas if you try to use more RAM than is available (including page file, etc) you’ll instantly get an Out Of Memory error (crash, though sometimes hidden from the end user).
When you try to install and load Tensorflow, you need enough RAM for Python itself + TensorFlow + all the memory object created by the packages TensorFlow depends on + the size of the rehydrated model artifact. This may be a tall order for a RPi zero and could explain why it failed for you. Although, I believe TensorFlow dropped support for ARMv6 so that could also explain it. If the latter, you might be able to get around that by compiling TensorFlow yourself.
In your RPi 3B example, since it works, albeit very slowly, it indicates that there is sufficient RAM at least, but some of the slowness could come from the type of RAM. If the model uses more memory than the system has physical RAM, it’s likely using the page file which is orders of magnitude slower than regular RAM. That being said, I suspect most of the slowness comes from the throughput of the CPU.
Some options:
- Optimise the performance of the existing hardware
- Check that the CPU is properly (100%) utilised under load and isn’t throttled by temperature, add fans, etc, to maximize throughput
- See if you can get away with disabling the page file.
- Add an external dedicated GPU (difficult and very expensive)
- Optimise the architecture of the model
- Check that you are using the latest, production ready version of the model. The ML landscape moves fast. Many optimization are baked into new model releases, so there are usually performance improvements from updating to the latest release. Just make sure it’s not a ‘beta’ or ‘experimental’ release, which often have performance regressions that aren’t fixed until they are promoted to be ‘production ready’.
- Some models can be retrained or tuned to trade off precision and accuracy (result quality) for speed and smaller memory footprint. Without knowing your model, I can’t say if this is the case. It usually revolves around dropping floating point precision and quantization, and can be quite scientifically involved. This may not be an option if you’re just downloading a pre-trained model from somewhere and don’t have the access or knowledge to retain, tune or modify it. The trade-off of quality-to-performance may also not be to your liking (i.e. results become too inaccurate for the use case)
- Try an equivalent model based on PyTorch. No guarantees this will be better or even supported on RPi.
- Use specialised edge hardware
-
@potatoman mentioned Google’s Coral. There is also NVidia’s Jetson. These are typically slightly more expensive than regular RPis but the price-performance trade-off is VERY worthwhile.
- Use remote compute
- As you mentioned, running inference in the cloud would obviate the need for a specialised, potentially expensive edge device. You could still do all the non-ML/computer vision stuff on a tiny, inexpensive device like and RPi zero and just call out to a cloud endpoint for the vision tasks. You could use the phones people have already as edge devices with this too. This is a very common pattern in industry. Depending on usage (how many API calls for vision tasks per second, etc) this could be the most cost effective option, as a single cloud node can serve many edge devices, scale automatically to meet spikes in demand and only charge you for what you use. The main providers would be Microsoft Azure, Google Cloud Platform (GCP) and Amazon Web Services (AWS). Pricing is similar enough between these that usability and familiarity is typically the selection criteria. They also all have ‘free tiers’ which, again, depending on usage, could make it completely free. There are smaller players, but they all require more work on your part to get going in exchange for slightly cheaper prices. I can vouch for Vertex AI on GCP being a good, friendly experience.
- Consider a decent gaming PC somehow securely reachable via the internet. Anything with an NVidia RTX* card and 12+ gigs of vRAM.
Trying to optimize the hardware and model you’ve already got will likely be the cheapest in the short term, but the speed gains might not even be noticeable; most edge devices just aren’t built for this. If you’re set on having the model running on the edge device, I would switch to the ML accelerated boards mentioned above. I think running the model remotely is still the best option. A cloud node is more flexible and you can easily switch or modify model architectures in one place without having to deploy new models to all your edge devices. With vision, you’d be transmitting a lot of images, so keep an eye (or a cap) on bandwidth costs.
Is there anything you’d like me to go further into?