DeGirum YOLOv11s Object detection evaluation: Accuracy metrics decreased compared to Hailo-model-zoo result

Hi,

I’d like to share some notes about benchmarking object detection with Yolov11s.hef on HAILO8:

Setup:

Hardware: UP Ai edge board, using Hailo8 accelerator.

Benchmark model: YOLOv11s.hef from Hailo-model-zoo github repository. Pretrained, 80 classes, default release by Hailo.

YOLO model download Path: link.

Dataset: COCO-2017-val (5000 images)

Libraries used: From DeGrium evaluation guide: Hailo guide: Evaluating model accuracy after compilation .

Library versions:
Ubuntu OS 22.04 Jammy

Python: 3.10.19
HailoRT: 4.20.0

Hailo DFC: v3.31.0

Hailo Model Zoo: 2.16.0

DeGirum Tools: 0.22.4.

Results after running evaluation:

FPS: 77 to hailo-model-zoo’s 111 FPS. (Used model_time_profile module from DeGirum Tools)

mAp 50-95: 38% to hailo-model-zoo’s mAp 45.2%.

FPS dropped about 30% compared to the value stated in Hailo model zoo. I investigated and found PCIE Gen3 only running 2 our of 4 lanes available, this might be part of the reason why the FPS slowed down.

Accuracy dropped while it should be approximately the same as stated by Hailo model zoo.

Question 1: Is there possibly any other factors that might cause FPS slowing down during inference

Question 2: What are the possible causes for accuracy decrease when evaluating using DeGirum Tools on the yolov11s.HEF model?

Please help! Thank you.

Hi @LongVu-Tr

Welcome to DeGirum community.

Just to be sure: the hef file is from hailo model zoo or did you compile another yolo11s by yourself? Can you please share how you obtained the 111FPS and 45.2% mAP? Are these just numbers from the hailo model zoo github page?

Hi @LongVu-Tr

We ran the evaluation of yolo11s and obtained the following results:

yolo11s_coco--640x640_quant_hailort_hailo8_1
[array([0.45358596, 0.62901933, 0.48389093, 0.2788012 , 0.4938611 ,
       0.624764  , 0.35266808, 0.57291141, 0.61078248, 0.43059618,
       0.66046639, 0.774312  ])]

So the mAP of 45.35% is very close to the reported 45.2%. If you share the exact script you used to evaluate the model (along with model JSON), we can see if there is any discrepancy.

Regarding FPS: It is host dependent. if you believe PySDK is slower than Hailo benchmark, please run hailortclicommand on the hef file. On our system, we get the following results:
degirum:0.19.0 Observed FPS=83.7

With hailortcli, we get the below numbers:

FPS     (hw_only)               = 83.55
        (streaming)               = 84.4837
Latency (hw)                     = 10.1392 ms
Device 0000:03:00.0:
Power in streaming mode (average) = 1.97945 W
                                            (max)     = 1.99689 W

Hi,
Sorry for my late reply.

I confirm that the HEF file is from the Hailo Model Zoo. I re-evaluated the model using the attached JSON and script configurations with hailomz eval and obtained the following results:
Yolov11s.hef: mAP50-95 = 45%

When I run hailomz eval, I get 45% mAP, which matches the number shown in the Hailo Model Zoo. However, when evaluated with the DeGirum library, it shows 37% mAP as mentioned in the question before. I assume something might be missing or different in the JSON configuration. Could you please check the attached files?

Additionally, for custom model compilations with a different number of classes, what exactly needs to be changed in order to enable proper evaluation after compilation?

Thank you.

JSON: {
“ConfigVersion”: 11,
“Checksum”: “da96ad3b3730500d56c8e13d164d44a78eb6f062516717d4c4195f7995a8c391”,
“DEVICE”: [
{
“DeviceType”: “HAILO8”,
“RuntimeAgent”: “HAILORT”,
“SupportedDeviceTypes”: “HAILORT/HAILO8L, HAILORT/HAILO8”
}
],
“PRE_PROCESS”: [
{
“InputN”: 1,
“InputH”: 640,
“InputW”: 640,
“InputC”: 3,
“InputQuantEn”: true
}
],
“MODEL_PARAMETERS”: [
{
“ModelPath”: “yolov11s-80class.hef”
}
],
“POST_PROCESS”: [
{
“OutputPostprocessType”: “DetectionYoloHailo”,
“OutputNumClasses”: 80,
“LabelsPath”: “labels_yolov11.json”
}
]
}

import degirum as dg
import degirum_tools
from degirum_tools.detection_eval import ObjectDetectionModelEvaluator
import numpy as np 
# Load the detection model
model = dg.load_model(
    model_name="yolov11s-80class",
    inference_host_address="@local",
    zoo_url="/home/Downloads/vutl1-hailo-test/models/yolov11s-80class/yolov11s-80class.json",
    token=''
)

# Optional class ID remapping: model → COCO
classmap = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
            27, 28, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51,
            52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 70, 72, 73, 74, 75, 76, 77,
            78, 79, 80, 81, 82, 84, 85, 86, 87, 88, 89, 90]

# Create evaluator
evaluator = ObjectDetectionModelEvaluator(model, classmap=classmap)

# Evaluation inputs
image_dir = "/home/Downloads/vutl1-hailo-test/datasets/vutlval2017/val2017"
coco_json = "/home/Downloads/vutl1-hailo-test/datasets/vutlval2017/annotations/instances_val2017.json"

# Evaluate and return mAP results
results = evaluator.evaluate(image_dir, coco_json, max_images=0)

# Print COCO-style mAP results
# print("COCO mAP stats:", results[0])

metric_labels = [
    "mAP@[IoU=0.50:0.95]",
    "mAP@0.50",
    "mAP@0.75",
    "mAP_small",
    "mAP_medium",
    "mAP_large",
    "AR@1",
    "AR@10",
    "AR@100",
    "AR_small",
    "AR_medium",
    "AR_large"
]

# Extract and print with metric names
print("COCO mAP/Eval Results:\n")
for label, value in zip(metric_labels, results[0]):
    print(f"{label:<20}: {value:.4f}")

# Compute overall statistics
mean_val = np.mean(results[0])
max_val = np.max(results[0])
min_val = np.min(results[0])

print("\nSummary Statistics:")
print(f"Mean: {mean_val:.4f}")
print(f"Max:  {max_val:.4f}")
print(f"Min:  {min_val:.4f}")


'''

Hi @LongVu-Tr

COCO is usually evaluated at specific thresholds, which have not been set in your evaluation script.

After loading the model, you need to set the output confidence threshold, nms threshold, and the max detections.

model.output_confidence_threshold =  0.001
model.output_nms_threshold = 0.7
model.output_max_detections = 300
model.output_max_detections_per_class = 300

This changes the mAP50:95 from 0.3860027 to 0.44955649 on my end.

This worked for my script. Thank you all for the assistance!

1 Like