I compiled my fine-tuned YOLO11L model using Degirum, but the model’s behavior has become abnormal. While the model worked correctly in a CUDA environment, after converting it into an HEF file, it started producing an excessive number of detections.
My hardware setup uses Hailo-8. During compilation, I set the input image width and height to 640, and configured the runtime and device as HAILORT and HAILO8, respectively. For calibration, I randomly selected 100 images from the training dataset.
The generated JSON file after compilation looked correct, so I expected the HEF model to run normally. However, when I ran the real-time detection module with a Raspberry Pi camera, the screen was filled with a large number of random bounding boxes.
I would greatly appreciate your technical support and guidance on this matter.
I will also attach the following files for reference:
An image showing the corrupted detections after converting to HEF
We reproduced the behavior: the model shows the same failures under OpenVINO INT8 (while FP32 works), which indicates the network is highly sensitive to quantization rather than a Hailo-specific issue. Note that OpenVINO INT8 quantization is less restrictive than Hailo’s full-INT8 flow, so sensitivity here is a strong signal.
Recommendations
Prefer a smaller YOLO variant (larger models tend to be more quantization-sensitive).
Retrain with ReLU6 activations (e.g., yolov8l_relu6) to reduce quantization loss.
So, does this mean that with the current model there’s no way to solve the issue?
Do you have any recommended model sizes? For example, would the M size work, and have you seen any cases where it was used successfully?
But one thing I’m curious about: the base YOLO11-L model quantized and ran well on Hailo-8, and its performance was good. Why is it that once we fine-tune the model, quantization doesn’t work as well?
This behavior is very dependent on the dataset and there is no easy rule to predict for what models this can happen. Do you know the size of training data for this model? And how many epochs you trained it for?
Sorry, by dataset size I meant number of images. So, you started with coco trained weights of YOLO11-L and ran 3 epochs of training? What does repeated 3 times mean?
Sorry for the confusion. The dataset contains about 700,000 images in total.
And for training, you can consider it as having run for a total of 9 epochs.
To be precise,
the dataset was stored on Google Drive and I trained it in a Colab environment.
Since I couldn’t upload the entire 700GB dataset at once,
I split it into 100GB chunks for training.
So the process was: upload 100GB → train for 3 epochs → upload the next 100GB → train again … → repeat this process over the entire dataset 3 times.
Could training in this way have affected the quantization results?
This is mostly a matter of dataset size and overfitting. The Yolov8n.pt and Yolov8l.pt training hyperparameters are different on the augmentation scales so overfitting does not happen.
The yolov8l or yolo11l have more aggressive augmentation, attached.
–scale 0.9 --mixup 0.15 --copy-paste 0.3
vs
yolov8n and yolov8s augmentation
–scale 0.5 --mixup 0 --copy-paste 0
This will help with overfitting and generalization which also help with quantization sensitivity.
My suggestions are
train smaller model, S or M
train with relu6
When training larger models, use more aggressive augmentation