Sortify: designing ML for the real world

April 2025 · 5 min read


The CV model achieved 94% accuracy in testing. At the pilot site, it was initially wrong about 30% of the time. This is the gap between benchmark accuracy and deployed performance — and it’s where real ML engineering happens.

The lighting problem

TrashNet, our primary training dataset, was collected in controlled indoor settings: consistent white-balance lighting, items placed on a clean surface, well-centred in frame. Real bins in Kampala are outside.

At 7am, a bin in shade looks very different from the same bin at 2pm in direct equatorial sunlight. Our classifier failed spectacularly on under-exposed frames — organic waste in dim light looked like e-waste in everything but colour.

The fix was a combination of:

Augmentation at training time:

transforms.ColorJitter(
    brightness=0.5,   # Simulate overexposure / shadow
    contrast=0.4,
    saturation=0.3,
    hue=0.1
)

CLAHE preprocessing at inference time (Contrast Limited Adaptive Histogram Equalisation) — normalises local contrast before the image hits the model. A 3-line OpenCV addition that moved outdoor accuracy from ~68% to 89% before any retraining.

import cv2
import numpy as np

def enhance_image(img_array):
    lab = cv2.cvtColor(img_array, cv2.COLOR_RGB2LAB)
    l, a, b = cv2.split(lab)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    l = clahe.apply(l)
    return cv2.cvtColor(cv2.merge([l, a, b]), cv2.COLOR_LAB2RGB)

The model size problem

My first model was ResNet-50 fine-tuned. 98MB, 12 seconds inference on a Raspberry Pi 4. Completely unusable — bins would be overflowing before a reading was confirmed.

I switched to MobileNetV3-Small, exported to ONNX, and ran it with ONNX Runtime:

  • Model size: 3.8MB
  • Inference time: 1.8 seconds on Pi 4 (including preprocessing)
  • Accuracy drop: 3% (94% → 91% on the test set, recovered to 94% after domain-specific augmentation)

The takeaway: start with the deployment constraint, then choose your model architecture. I wasted a week training ResNet variants before asking “how fast does this need to be on what hardware?”

The edge cases nobody warns you about

Bin bags. About 40% of waste is placed in opaque black bags, making classification impossible from the outside. We added a “bagged” class that triggers a different alert (request manual check) rather than a misclassification.

Partial fills. A bin that’s 20% full organic + 5% plastic contamination at the top looks like “plastic” to the camera. We addressed this by only using the camera for classification on fresh additions (detected by a PIR motion sensor trigger), not periodic sweeps.

Night. We added an IR LED ring. Infrared images required a separate normalisation pass — the colour model didn’t work on effectively greyscale IR frames. We ended up maintaining two lightweight models: one for daylight, one for IR, switched by a light sensor reading.

What I’d do differently

Collect field data first, train second. We collected ~300 field photos after the model was already trained. The right approach is to deploy a dumb camera, collect 1000+ real images, label them, and then train — even if it delays the model by two weeks.

Test the hardware before you design the model. Pi 4 with ONNX Runtime was fine. An older Pi 3B+ we had available was not. Hardware decisions should precede model architecture decisions.

Confidence thresholds over hard classifications. Rather than outputting “plastic”, the system should output {"class": "plastic", "confidence": 0.87} and only act on high-confidence predictions. Low-confidence readings queue for human review.

The result

After these iterations, the field accuracy stabilised around 91% — down from 94% on the clean test set, but acceptable for the use case. The 9% that the model gets wrong is flagged for manual review, not silently misclassified.

This is what real-world ML looks like: not a single clean accuracy number, but a system designed around the distribution of failures and a process for handling them.


Full architecture and code in the Sortify case study.

← All Posts