Blog · Computer Vision · Edge ML
Sortify: designing ML for the real world
April 2025 · 5 min read
The CV model achieved 94% accuracy in testing. At the pilot site, it was initially wrong about 30% of the time. This is the gap between benchmark accuracy and deployed performance — and it’s where real ML engineering happens.
The lighting problem
TrashNet, our primary training dataset, was collected in controlled indoor settings: consistent white-balance lighting, items placed on a clean surface, well-centred in frame. Real bins in Kampala are outside.
At 7am, a bin in shade looks very different from the same bin at 2pm in direct equatorial sunlight. Our classifier failed spectacularly on under-exposed frames — organic waste in dim light looked like e-waste in everything but colour.
The fix was a combination of:
Augmentation at training time:
transforms.ColorJitter(
brightness=0.5, # Simulate overexposure / shadow
contrast=0.4,
saturation=0.3,
hue=0.1
)CLAHE preprocessing at inference time (Contrast Limited Adaptive Histogram Equalisation) — normalises local contrast before the image hits the model. A 3-line OpenCV addition that moved outdoor accuracy from ~68% to 89% before any retraining.
import cv2
import numpy as np
def enhance_image(img_array):
lab = cv2.cvtColor(img_array, cv2.COLOR_RGB2LAB)
l, a, b = cv2.split(lab)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
l = clahe.apply(l)
return cv2.cvtColor(cv2.merge([l, a, b]), cv2.COLOR_LAB2RGB)The model size problem
My first model was ResNet-50 fine-tuned. 98MB, 12 seconds inference on a Raspberry Pi 4. Completely unusable — bins would be overflowing before a reading was confirmed.
I switched to MobileNetV3-Small, exported to ONNX, and ran it with ONNX Runtime:
- Model size: 3.8MB
- Inference time: 1.8 seconds on Pi 4 (including preprocessing)
- Accuracy drop: 3% (94% → 91% on the test set, recovered to 94% after domain-specific augmentation)
The takeaway: start with the deployment constraint, then choose your model architecture. I wasted a week training ResNet variants before asking “how fast does this need to be on what hardware?”
The edge cases nobody warns you about
Bin bags. About 40% of waste is placed in opaque black bags, making classification impossible from the outside. We added a “bagged” class that triggers a different alert (request manual check) rather than a misclassification.
Partial fills. A bin that’s 20% full organic + 5% plastic contamination at the top looks like “plastic” to the camera. We addressed this by only using the camera for classification on fresh additions (detected by a PIR motion sensor trigger), not periodic sweeps.
Night. We added an IR LED ring. Infrared images required a separate normalisation pass — the colour model didn’t work on effectively greyscale IR frames. We ended up maintaining two lightweight models: one for daylight, one for IR, switched by a light sensor reading.
What I’d do differently
Collect field data first, train second. We collected ~300 field photos after the model was already trained. The right approach is to deploy a dumb camera, collect 1000+ real images, label them, and then train — even if it delays the model by two weeks.
Test the hardware before you design the model. Pi 4 with ONNX Runtime was fine. An older Pi 3B+ we had available was not. Hardware decisions should precede model architecture decisions.
Confidence thresholds over hard classifications. Rather than outputting “plastic”, the system should output {"class": "plastic", "confidence": 0.87} and only act on high-confidence predictions. Low-confidence readings queue for human review.
The result
After these iterations, the field accuracy stabilised around 91% — down from 94% on the clean test set, but acceptable for the use case. The 9% that the model gets wrong is flagged for manual review, not silently misclassified.
This is what real-world ML looks like: not a single clean accuracy number, but a system designed around the distribution of failures and a process for handling them.
Full architecture and code in the Sortify case study.