Hailo guide 4: Simplifying instance segmentation on a Hailo device using DeGirum PySDK

This guide demonstrates how to leverage PySDK’s built-in segmentation post-processing—integrated in C++—to run YOLOv8/YOLO11 segmentation models on Hailo devices. With minimal configuration, you can run inference and visualize segmentation outputs, including class labels. Although this example uses a model trained on the COCO dataset, the method works for any YOLOv8/YOLO11 segmentation model with appropriate modifications.

Tip: If you are new to PySDK, consider reviewing our previous user guide before diving into segmentation.

Overview of the inference pipeline

For segmentation models, the inference pipeline consists of:

  1. Pre-processing:
    Resize and format the input image (e.g., letterbox padding, bilinear interpolation, quantization) to match model requirements.
  2. Inference:
    Run the YOLOv8/YOLO11 segmentation model (compiled into a .hef file) on the Hailo device.
  3. Post-processing:
    The integrated C++ postprocessing converts the model’s raw outputs into a segmentation mask overlay. To enable this processing, specify "OutputPostprocessType": "SegmentationYoloV8" in the JSON configuration. In this guide, a COCO labels file is provided for human-readable output.
  4. Visualization:
    The processed segmentation overlay is provided as an image, which can be displayed using tools such as OpenCV.

A simple diagram of the pipeline:

Input image
    │
    ▼
Pre-processing (resize, letterbox, quantize)
    │
    ▼
Model inference (.hef file on Hailo device)
    │
    ▼
Built-in post-processing (SegmentationYoloV8 in C++)
    │
    ▼
Segmentation overlay (with COCO labels)
    │
    ▼
Visualization (e.g., via OpenCV)

What you’ll need

Ensure you have the following prerequisites:

  1. Hailo AI Accelerator:
    A Hailo8 or Hailo8L device. The host system can be x86 or an Arm-based system (e.g., Raspberry Pi).
  2. Drivers and software tools:
    Install the necessary drivers and follow the Hailo + PySDK setup instructions.
  3. Segmentation model file (.hef):
    A YOLOv8/YOLO11 segmentation model trained on COCO, compiled into a .hef file. For example, you can use yolov8n_seg.hef available at Hailo Model Zoo.
  4. Input image:
    An image on which to run segmentation. For instance, download this Cat Image. Feel free to experiment with your own images.
  5. COCO labels file (labels_coco.json):
    A file mapping the 80 COCO classes to human-readable labels. You can download this from Hugging Face or another trusted source.

Summary

We’ll walk you through the key steps to run segmentation inference on a Hailo device using DeGirum PySDK:

  • Configuring the model JSON file: Set up your JSON file to define pre-processing parameters, specify the segmentation model file, and enable the built-in C++ postprocessing (using "OutputPostprocessType": "SegmentationYoloV8"). This section also covers setting the number of classes, a confidence threshold, and the "SigmoidOnCLS" flag if required by your model.
  • Preparing the model zoo: Organize your model assets—including the JSON configuration file, the .hef model file, and the COCO labels file—into a structured directory for easy access and management by PySDK.
  • Running inference: Load the segmentation model from the model zoo, execute inference on an input image, and obtain a segmentation overlay that visually represents the segmentation masks along with human-readable class labels.
  • Visualizing the output: Use tools such as OpenCV to display the segmentation overlay, enabling you to review and analyze the segmented regions.

By following these steps, you can seamlessly deploy and visualize YOLOv8/YOLO11 segmentation models on Hailo devices using PySDK.

Configuring the model JSON file

Since the segmentation postprocessing is integrated in C++ with PySDK, the JSON configuration is straightforward. In addition to specifying pre-processing and the model file, you will also provide the COCO labels file, the number of classes, and a confidence threshold to filter low-probability detections.

Example model JSON (yolov8n_seg.json)

{
    "ConfigVersion": 10,
    "Checksum": "5ccc384699f608188621975c0121aa1f01aa4398af30a00100474bae964195a8",
    "DEVICE": [
        {
            "DeviceType": "HAILO8",
            "RuntimeAgent": "HAILORT",
            "SupportedDeviceTypes": "HAILORT/HAILO8"
        }
    ],
    "PRE_PROCESS": [
        {
            "InputType": "Image",
            "InputN": 1,
            "InputH": 640,
            "InputW": 640,
            "InputC": 3,
            "InputPadMethod": "letterbox",
            "InputResizeMethod": "bilinear",
            "InputQuantEn": true
        }
    ],
    "MODEL_PARAMETERS": [
        {
            "ModelPath": "yolov8n_seg.hef"
        }
    ],
    "POST_PROCESS": [
        {
            "OutputPostprocessType": "SegmentationYoloV8",
            "LabelsPath": "labels_coco.json",
            "OutputNumClasses": 80,
            "OutputConfThreshold": 0.3,
            "SigmoidOnCLS": true
        }
    ]
}

Key points

  • Pre-processing section:
    The input image is resized to 1 x 640 x 640 x 3 using letterbox padding and bilinear interpolation, with quantization enabled.
  • Model parameters section:
    Specifies the segmentation model file (yolov8n_seg.hef).
  • Post-processing section:
    • "OutputPostprocessType": "SegmentationYoloV8" activates the built-in C++ segmentation postprocessing. This setting works for both YOLOv8 and YOLO11 models.
    • "LabelsPath": "labels_coco.json" provides the COCO labels for human-readable output.
    • "OutputNumClasses": 80 specifies the number of classes in the model.
    • "OutputConfThreshold": 0.3 filters out detections below the confidence threshold.
    • Understanding SigmoidOnCLS:
      The "SigmoidOnCLS": true flag indicates that a sigmoid activation is applied on certain output layers. This flag is necessary when models are compiled with vendor-specific settings that apply sigmoid activations; adjust this flag as needed for your model.

Note that CheckSum is a required field and can be any dummy value. When the model is uploaded to AI Hub, the correct checksum is calculated. When used locally, the value does not matter.

Preparing the model zoo

A model zoo is a structured repository of model assets (configuration JSON files, model files, post-processor code, and labels) that simplifies model management. To organize your assets:

  1. Save the JSON configuration as yolov8n_seg.json.
  2. Place the segmentation model file (yolov8n_seg.hef) in the same directory.
  3. Include the COCO labels file as labels_coco.json.

Your directory structure might look like:

/path/to/model_zoo/
├── yolov8n_seg.json
├── yolov8n_seg.hef
└── labels_coco.json

Tip: For easier maintenance, you can organize models into separate subdirectories. PySDK will automatically search for model JSON files in all subdirectories inside the directory specified by the zoo_url.

Running inference

Once your model zoo is set up, running inference is nearly identical to the process for detection models. The inference output includes an overlay image with segmentation masks and COCO labels.

Python code example

import degirum as dg
import cv2

# Load the segmentation model from the model zoo.
# Replace '<path_to_model_zoo>' with the directory path to your model assets.
model = dg.load_model(
    model_name='yolov8n_seg',
    inference_host_address='@local',
    zoo_url='<path_to_model_zoo>'
)

# Run inference on an input image.
# Replace '<path_to_input_image>' with the actual path to your image.
inference_result = model('<path_to_input_image>')

# The segmentation overlay (with masks and labels) is available via the image_overlay attribute.
cv2.imshow("Segmentation Output", inference_result.image_overlay)

# Wait until the user presses 'x' or 'q' to close the window.
while True:
    key = cv2.waitKey(0) & 0xFF
    if key == ord('x') or key == ord('q'):
        break

cv2.destroyAllWindows()

Expected Output

The displayed window should show the input image with segmentation masks overlaid. Each segment will be highlighted with distinct colors, and the COCO class labels will be visible on the overlay.

Example output showing a cat image with segmented regions and labeled classes.

Troubleshooting and debug tips

  • Verify file paths:
    Ensure that the JSON configuration file, model file, and labels file are in the correct locations, and that the paths specified in the JSON (e.g., "ModelPath" and "LabelsPath") match the actual file names.
  • Input dimensions:
    Confirm that the dimensions specified in the PRE_PROCESS section (e.g., 640×640) match your model’s input requirements.
  • Number of classes:
    Double-check that "OutputNumClasses" is correctly set to the number of classes your model detects.
  • SigmoidOnCLS flag:
    If you experience unexpected behavior in the postprocessed output, verify that the "SigmoidOnCLS" flag is correctly configured for your model’s compiled settings.

Conclusion

This guide has shown you how to run YOLOv8/YOLO11 segmentation models on Hailo devices using DeGirum PySDK. By simply specifying "OutputPostprocessType": "SegmentationYoloV8", providing a COCO labels file (labels_coco.json), and correctly setting parameters such as "OutputNumClasses", "OutputConfThreshold", and "SigmoidOnCLS", you enable the built-in C++ segmentation postprocessing. This converts raw model outputs into visually interpretable segmentation masks with human-readable labels.

The method outlined here also applies to any custom YOLOv8/YOLO11 segmentation model compiled for Hailo devices—simply adjust the JSON configuration (especially the labels file and number of classes) to match your model’s specifications.

1 Like

Hi Shashi.
It could be possible to apply first a person detector and its Tracker and then use a segmentation model with only one person ID that matches some requirement???
If so, could you show a little example
As usual, thanks for your great support
Juan Hidalgo

Hi @jhidalgo
Just to make sure we understand your use case, can you confirm this is the flow you have in mind?

video source-->person detector-->tracker-->crop person with specific track id-->segmentation on cropped image

Well, that’s an alternative flow. I thought just pass the whole image and the ID for the selected person and finally get from segmentation model the complete image with only the person selected in segmented overlay…

Hi @jhidalgo
How will you pass the whole image and ID for selected person without tracking? Who provides you the ID?

Sorry, I forgot to put the detector+tracker.
This is my flow:
Video Source —> person detector —> tracker —> select person and get its ID —> segmentation ( image, ID)

Is this possible?

Hi @jhidalgo
I see this flow to be the same as the one I proposed. Maybe I am missing something. Can you tell me what select person and get its ID mean? How would you do this in a program? Is this some interactive application?

Yes, your flow does the final work I need, but after segmentation, I would have to reinsert the cropped part into the original image.
What I’m trying to do is this flow:
Video Source —> person’s keypoints detections —> person tracker —> look for persons that’s looking at camera and get its IDs —> pass the whole image and IDs to segmentation model

Hi @jhidalgo
Understood. Please note that in the flow you suggested, there are no savings in computation or increase in accuracy because we cannot really ask the model to segment only 1 instance. The model will segment all people and we can choose to filter out all the other masks. If you are not looking for savings in computations or increase in accuracy, you can just run segmentation on the video source, add tracker and filter the person looking at the camera. This way you have only 1 model instead of two and your latency will be much lower.

Running segmentation in video Source, will give me keypoints of persons detected?

I will try your suggested flow.
One more question, How can I just annotate the person with an know ID in the overlayed image

Hi @jhidalgo
Sorry. I see you need both keypoints and segmentation. You can run both of the models on the video source and add a tracker and filter the masks and keypoints of the ID you are interested.

Ok, I will try anyway.
My final goal is to get an overlayed image with only the persons that look at the camera painted in green just like the output you get in this Guide 4

@jhidalgo, what is your plan to determine persons looking at the camera? For example, how to distinguish a person staying back to the camera from the person looking into the camera?

Hi vladk, I’m just interested in people who look into the camera, that’s why I need to know the position of his/her eyes…

Do you want to do segmentation of all people looking into the camera, or just single one?

I am not aware of existence of official YOLO model, which simultaneously does pose detection and semantic segmentation of a person. The models we have in our zoo either do object detection + pose estimation, or object detection + semantic segmentation, but not all three.
Do you have such model trained?

Hi, I don´t have such a model. I´d like to segment only single one. Someone who´s looking into the camera and is close. Distance between eyes is an aproximation of how close to camera you are. Any suggestion ?

Hi @jhidalgo ,

The following script uses DeGirum PySDK and DeGirum Tools packages to do what you requested. You need to install latest version 0.16.7 of degirum_tools.

You may need to adjust some parameters at the beginning of the script to select inference location, model zoo, models. etc. If you use cloud zoo, you will also need a token
(you may paste it instead of degirum_tools.get_token() expression or put it into env.ini file).

Set use_tracking to False to select camera-facing person with biggest bbox at every frame
Set use_tracking to True, to make this selection once and continue track the person until leaving the frame, then select next person to track.

Camera facing is detected by comparing scores of eyes and nose. If they are all bigger than 0.5, then the face is towards the camera (see face_in_front() function)

This example uses Streams - lightweight multi-threading pipeline framework, developed by DeGirum. You can read docs at this link: Streams | DeGirum Docs

Example uses video clip with walking people. You can use local camera instead by assigning video_source = 0 (or any other video source like file, URL, RTSP stream etc.)

import degirum as dg, degirum_tools, numpy as np, cv2
from degirum_tools import streams as dgstreams

# adjust all these parameters to your needs
hw_location = "@cloud"
model_zoo_url = "degirum/public"
person_model_name = "yolov8n_relu6_coco_pose--640x640_quant_n2x_orca1_1"
segm_model_name = "yolov8n_relu6_coco_seg--640x640_quant_n2x_orca1_1"
video_source = "https://raw.githubusercontent.com/DeGirum/PySDKExamples/main/images/WalkingPeople2.mp4"
use_tracking = False  # set to False to disable object tracking and always select the biggest person detected

# connect to the model zoo
zoo = dg.connect(
    inference_host_address=hw_location,
    zoo_url=model_zoo_url,
    token=degirum_tools.get_token(),
)

# load person/pose detection model
person_model = zoo.load_model(person_model_name, overlay_line_width=1)
person_label = person_model.label_dictionary[0]

# load segmentation model
segm_model = zoo.load_model(
    segm_model_name,
    output_class_set={person_label},
    # adjust some overlay parameters for better visualization
    overlay_show_probabilities=False,
    overlay_show_labels=False,
    overlay_line_width=0,
    overlay_color=(0, 255, 0),
)
assert person_label in segm_model.label_dictionary.values()

# create object tracker analyzer to track objects
tracker = degirum_tools.ObjectTracker(
    track_thresh=0.35,
    match_thresh=0.9999,
    anchor_point=degirum_tools.AnchorPoint.CENTER,
    show_overlay=False,
)


def face_in_front(obj, result):
    """Return person bbox area when person is facing the camera."""
    keypoints = obj["landmarks"]
    # if nose and both eyes have good score then the person is facing the camera
    front_facing = all(keypoints[i]["score"] > 0.5 for i in range(3))
    # return bbox area if the person is facing the camera, otherwise return 0
    return degirum_tools.math_support.area(np.array(obj["bbox"])) if front_facing else 0


# create object selector analyzer to select only one person
selector = degirum_tools.ObjectSelector(
    top_k=1,
    selection_strategy=degirum_tools.ObjectSelectionStrategies.CUSTOM_METRIC,
    # define custom metric to select the object of interest: object with highest metric value is selected
    custom_metric=face_in_front,
    use_tracking=use_tracking,
    tracking_timeout=3,
    show_overlay=False,
)

# attach object tracker and object selector analyzers to person detection model
degirum_tools.attach_analyzers(person_model, [tracker, selector])


# this gizmo overlays segmentation results on the original image
# more on gizmos here: https://docs.degirum.com/degirum-tools/overview/streams/streams_base#gizmo
class SegmentationOverlayGizmo(dgstreams.Gizmo):
    def __init__(self):
        super().__init__([(10, False)])

    def run(self):
        for data in self.get_input(0):
            if self._abort:
                break
            # find the result of segmentation model: it should be the last meta with tag inference
            segm_result = data.meta.find_last(dgstreams.tag_inference)
            # find the result of cropping: it should be the last meta with tag crop
            crop = data.meta.find_last(dgstreams.tag_crop)
            # extract the result of person detector from cropping result
            orig_result = crop[
                dgstreams.AiObjectDetectionCroppingGizmo.key_original_result
            ]
            # extract the index of cropped object from cropping result
            bbox_idx = crop[dgstreams.AiObjectDetectionCroppingGizmo.key_cropped_index]
            # get bounding box coordinates of the cropped object
            x, y = np.clip(
                np.array(orig_result.results[bbox_idx]["bbox"]).astype(int)[:2], 0, None
            )
            # generate annotated image for the original result
            orig_img = orig_result.image_overlay
            # generate annotated image for the segmentation result
            segm_img = segm_result.image_overlay
            # blend them
            h, w = segm_img.shape[:2]
            try:
                orig_img[y : y + h, x : x + w] = cv2.addWeighted(
                    orig_img[y : y + h, x : x + w], 0.5, segm_img, 0.5, 0
                )
            except:
                pass
            # send blended result
            self.send_result(dgstreams.StreamData(orig_img, data.meta))


# create gizmos
source = dgstreams.VideoSourceGizmo(video_source)  # video source
person = dgstreams.AiSimpleGizmo(person_model)  # person detector
crop = dgstreams.AiObjectDetectionCroppingGizmo([person_label])  # cropper
pose = dgstreams.AiSimpleGizmo(segm_model)  # segmentation
overlay = SegmentationOverlayGizmo()
display = dgstreams.VideoDisplayGizmo()  # display

# create pipeline and composition, then start it
dgstreams.Composition(source >> person >> crop >> pose >> overlay >> display).start()
1 Like

Guau! Such a program! I will check it out…
Any chance to use it with a raspberry pi camera ? Thank you very much.