SUSE AI Observability Extension

通过 SUSE AI Observability Extension,能够实现在 SUSE O11y 监控 SUSE AI。

参考文档:https://documentation.suse.com/suse-ai/1.0/html/AI-deployment-intro/index.html

部署流程如下:

此处将 SUSE O11y 和 SUSE AI 部署在一个 RKE2 集群中,该集群需要有 GPU 节点。

GPU 节点安装 Nvidia Drivers 和 Nvidia Container Toolkit

参考文档:https://warnerchen.github.io/2024/12/17/RKE-RKE2-%E8%8A%82%E7%82%B9%E9%85%8D%E7%BD%AE-Nvidia-Container-Toolkit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@gpu-0:~# nvidia-smi
Thu Aug 14 14:51:56 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P4 Off | 00000000:03:00.0 Off | 0 |
| N/A 41C P8 7W / 75W | 3MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

部署 RKE2 集群

此处省略部署步骤,需要将 GPU 节点注册至 RKE2 集群:

1
2
3
4
5
6
root@suse-o11y-01:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
gpu-0 Ready worker 2d3h v1.30.13+rke2r1
suse-o11y-01 Ready control-plane,etcd,master,worker 28d v1.30.13+rke2r1
suse-o11y-02 Ready control-plane,etcd,master,worker 78d v1.30.13+rke2r1
suse-o11y-03 Ready control-plane,etcd,master,worker 78d v1.30.13+rke2r1

添加至集群后,添加 Label:

1
kubectl label node gpu-0 accelerator=nvidia-gpu

在 GPU 节点,执行下面的命令:

1
2
echo PATH=$PATH:/usr/local/nvidia/toolkit >> /etc/default/rke2-agent
systemctl restart rke2-agent

安装 GPU Operator

参考文档:https://warnerchen.github.io/2024/12/17/RKE-RKE2-%E8%8A%82%E7%82%B9%E9%85%8D%E7%BD%AE-Nvidia-Container-Toolkit/#%E5%AE%89%E8%A3%85-GPU-Operator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
root@suse-o11y-01:~# helm -n gpu-operator ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-v25-1754905550 gpu-operator 1 2025-08-11 09:45:50.859496738 +0000 UTC deployed gpu-operator-v25.3.2 v25.3.2
root@suse-o11y-01:~# kubectl -n gpu-operator get pod
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-wfqdb 1/1 Running 0 2d4h
gpu-operator-664ff6658f-w8kms 1/1 Running 0 2d21h
gpu-operator-v25-1754905550-node-feature-discovery-gc-8544t2xrm 1/1 Running 0 2d21h
gpu-operator-v25-1754905550-node-feature-discovery-master-rlmtl 1/1 Running 0 2d21h
gpu-operator-v25-1754905550-node-feature-discovery-worker-cwmnd 1/1 Running 0 2d21h
gpu-operator-v25-1754905550-node-feature-discovery-worker-ltnf2 1/1 Running 0 2d21h
gpu-operator-v25-1754905550-node-feature-discovery-worker-sv68s 1/1 Running 0 2d4h
gpu-operator-v25-1754905550-node-feature-discovery-worker-xhf67 1/1 Running 0 2d21h
nvidia-container-toolkit-daemonset-ktjc8 1/1 Running 0 2d4h
nvidia-cuda-validator-ztqzg 0/1 Completed 0 2d3h
nvidia-dcgm-exporter-2cxkm 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-xldvb 1/1 Running 0 2d4h
nvidia-operator-validator-fbvjf 1/1 Running 0 2d4h

部署 SUSE Observability

参考文档:https://warnerchen.github.io/2025/03/03/SUSE-Observability-%E4%BD%BF%E7%94%A8%E9%9A%8F%E8%AE%B0/

由于 SUSE AI 也部署在该集群,所以也部署了 SUSE O11y Agent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
root@suse-o11y-01:~# kubectl -n suse-observability get pod
NAME READY STATUS RESTARTS AGE
suse-observability-agent-checks-agent-796bd7545c-85rjh 1/1 Running 0 2d
suse-observability-agent-cluster-agent-7dc4cf749f-pjrjl 1/1 Running 0 2d
suse-observability-agent-logs-agent-b59hf 1/1 Running 0 2d
suse-observability-agent-logs-agent-b6pfb 1/1 Running 0 2d
suse-observability-agent-logs-agent-bbscp 1/1 Running 0 2d
suse-observability-agent-logs-agent-zf5vb 1/1 Running 0 2d
suse-observability-agent-node-agent-9ftf2 2/2 Running 0 2d
suse-observability-agent-node-agent-h7fcb 2/2 Running 1 (41h ago) 2d
suse-observability-agent-node-agent-k5snl 2/2 Running 9 (43h ago) 2d
suse-observability-agent-node-agent-kplqr 2/2 Running 2 (33h ago) 2d
suse-observability-clickhouse-shard0-0 2/2 Running 0 3d
suse-observability-correlate-56c76b747b-xmfqb 1/1 Running 2 (3d ago) 3d
suse-observability-e2es-54dff667c9-8bszc 1/1 Running 2 (3d ago) 3d
suse-observability-elasticsearch-master-0 1/1 Running 0 3d
suse-observability-hbase-stackgraph-0 1/1 Running 0 3d
suse-observability-hbase-tephra-0 1/1 Running 0 3d
suse-observability-kafka-0 2/2 Running 10 (3d ago) 3d
suse-observability-kafkaup-operator-kafkaup-d7c687c47-49q7g 1/1 Running 0 3d
suse-observability-otel-collector-0 1/1 Running 0 3d
suse-observability-prometheus-elasticsearch-exporter-c4587sgltv 1/1 Running 0 3d
suse-observability-receiver-7cbffd6684-wqbzl 1/1 Running 2 (3d ago) 3d
suse-observability-router-9b56b74c7-s672n 1/1 Running 0 3d
suse-observability-server-69b9666fb9-4xtfs 1/1 Running 1 (3d ago) 3d
suse-observability-ui-7f87bff97b-ql6dn 2/2 Running 0 3d
suse-observability-victoria-metrics-0-0 1/1 Running 0 3d
suse-observability-vmagent-0 1/1 Running 0 3d
suse-observability-zookeeper-0 1/1 Running 0 3d

在 UI 中开启 Open Telemetry StackPack:

获取 SUSE Application Collection 凭证

SUSE AI Observability Extension 和 SUSE AI 的 Helm Chart 和镜像都需要从 SUSE Application Collection 获取,所以需要准备凭证:https://docs.apps.rancher.io/get-started/authentication/

使用凭证:

1
2
3
4
5
6
7
8
9
kubectl create secret docker-registry application-collection \
--docker-server=dp.apps.rancher.io \
--docker-username=APPCO_USERNAME \
--docker-password=APPCO_USER_TOKEN \
-n <your_namespace>

helm registry login dp.apps.rancher.io/charts \
-u APPCO_USERNAME \
-p APPCO_USER_TOKEN

由于国内从 dp.apps.rancher.io 获取镜像较慢,可以使用国内加速:https://artifacts.cnrancher.com/app-collection/

部署 SUSE AI Observability Extension

准备 SUSE AI Observability Extension 配置文件:

1
2
3
4
5
6
7
8
9
10
11
cat <<EOF > genai_values.yaml
global:
imagePullSecrets:
- application-collection
# 与 SUSE O11y 在同一个集群,可以直接使用集群内部地址
# 如果不在一个集群,需要使用 HTTPS,且暂时不支持自签证书
serverUrl: http://suse-observability-router.suse-observability.svc.cluster.local:8080
apiKey: xxx
tokenType: api
apiToken: xxx
clusterName: suse-o11y-with-suse-ai

执行安装:

1
helm -n suse-observability upgrade --install suse-ai-observability-extension oci://dp.apps.rancher.io/charts/suse-ai-observability-extension -f genai_values.yaml --version 1.0.4
1
2
3
4
5
6
7
8
9
10
11
root@suse-o11y-01:~# helm -n suse-observability ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
suse-ai-observability-extension suse-observability 3 2025-08-12 15:02:30.393347003 +0800 CST deployed suse-ai-observability-extension-1.0.2 1.0.4
suse-observability suse-observability 1 2025-08-11 14:40:03.659351484 +0800 CST deployed suse-observability-2.3.7 7.0.0-snapshot.20250722112540-master-34e65ce
suse-observability-agent suse-observability 2 2025-08-12 15:12:18.710129279 +0800 CST deployed suse-observability-agent-1.0.56 3.0.0

root@suse-o11y-01:~# kubectl -n suse-observability get pod
NAME READY STATUS RESTARTS AGE
suse-ai-observability-extension-29252596-bk7st 0/1 Completed 0 86s
suse-ai-observability-extension-init-dgtx7 0/1 Completed 0 2d
...

部署完成后,即可在 SUSE O11y 看到 GenAI Observability:

部署 OpenTelemetry Collector

1
2
3
4
kubectl create namespace observability
kubectl create secret generic open-telemetry-collector \
--namespace observability \
--from-literal=API_KEY='SUSE_OBSERVABILITY_API_KEY'

准备配置文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
cat <<EOF > otel_values.yaml
global:
imagePullSecrets:
- application-collection
extraEnvsFrom:
- secretRef:
name: open-telemetry-collector
mode: deployment
ports:
metrics:
enabled: true
presets:
kubernetesAttributes:
enabled: true
extractAllPodLabels: true
config:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'gpu-metrics'
scrape_interval: 10s
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- gpu-operator
- job_name: 'milvus'
scrape_interval: 15s
metrics_path: '/metrics'

static_configs:
- targets: ['milvus.suse-private-ai.svc.cluster.local:9091']
exporters:
otlp:
endpoint: https://suse-observability-otlp.warnerchen.com:443
headers:
Authorization: "SUSEObservability ${env:API_KEY}"
tls:
insecure_skip_verify: true
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: rate-limited-composite
type: composite
composite:
max_total_spans_per_second: 500
policy_order: [errors, slow-traces, rest]
composite_sub_policy:
- name: errors
type: status_code
status_code:
status_codes: [ ERROR ]
- name: slow-traces
type: latency
latency:
threshold_ms: 1000
- name: rest
type: always_sample
rate_allocation:
- policy: errors
percent: 33
- policy: slow-traces
percent: 33
- policy: rest
percent: 34
resource:
attributes:
- key: k8s.cluster.name
action: upsert
value: suse-o11y-with-suse-ai
- key: service.instance.id
from_attribute: k8s.pod.uid
action: insert
filter/dropMissingK8sAttributes:
error_mode: ignore
traces:
span:
- resource.attributes["k8s.node.name"] == nil
- resource.attributes["k8s.pod.uid"] == nil
- resource.attributes["k8s.namespace.name"] == nil
- resource.attributes["k8s.pod.name"] == nil
connectors:
spanmetrics:
metrics_expiration: 5m
namespace: otel_span
routing/traces:
error_mode: ignore
table:
- statement: route()
pipelines: [traces/sampling, traces/spanmetrics]
service:
extensions:
- health_check
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [filter/dropMissingK8sAttributes, memory_limiter, resource]
exporters: [routing/traces]
traces/spanmetrics:
receivers: [routing/traces]
processors: []
exporters: [spanmetrics]
traces/sampling:
receivers: [routing/traces]
processors: [tail_sampling, batch]
exporters: [debug, otlp]
metrics:
receivers: [otlp, spanmetrics, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [debug, otlp]

执行安装:

1
2
3
helm upgrade --install opentelemetry-collector \
oci://dp.apps.rancher.io/charts/opentelemetry-collector \
-f otel-values.yaml --namespace observability --version 0.131.0
1
2
3
4
5
6
7
root@suse-o11y-01:~# helm -n observability ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
opentelemetry-collector observability 2 2025-08-12 10:36:15.656142655 +0800 CST deployed opentelemetry-collector-0.130.1 0.131.0

root@suse-o11y-01:~# kubectl -n observability get pod
NAME READY STATUS RESTARTS AGE
opentelemetry-collector-fd675cb5d-lfnnc 1/1 Running 0 2d4h

在 OpenTelemetry Collector 中配置的 GPU 指标收集器需要创建自定义 RBAC 规则:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: suse-observability-otel-scraper
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
verbs:
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: suse-observability-otel-scraper
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: suse-observability-otel-scraper
subjects:
- kind: ServiceAccount
name: opentelemetry-collector
namespace: observability
EOF

部署 SUSE AI

参考文档:https://warnerchen.github.io/2025/02/19/SUSE-AI-%E4%BD%BF%E7%94%A8%E9%9A%8F%E8%AE%B0/

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
root@suse-o11y-01:~# helm -n suse-private-ai ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
milvus suse-private-ai 1 2025-08-12 10:15:32.150413955 +0800 CST deployed milvus-4.2.2 2.4.6
ollama suse-private-ai 1 2025-08-12 11:34:18.24052066 +0800 CST deployed ollama-1.16.0 0.6.8
open-webui suse-private-ai 2 2025-08-12 15:48:17.10667584 +0800 CST deployed open-webui-6.13.0 0.6.9

root@suse-o11y-01:~# kubectl -n suse-private-ai get pod
NAME READY STATUS RESTARTS AGE
milvus-datacoord-d46f8c674-67m2v 1/1 Running 2 (2d8h ago) 2d8h
milvus-datanode-58664f8c67-4ms2s 1/1 Running 2 (2d8h ago) 2d8h
milvus-etcd-0 1/1 Running 0 2d8h
milvus-indexcoord-849df85749-tvt2s 1/1 Running 0 2d8h
milvus-indexnode-7bbb6bd84-znpb6 1/1 Running 1 (2d8h ago) 2d8h
milvus-kafka-broker-0 1/1 Running 0 2d8h
milvus-kafka-controller-0 1/1 Running 0 2d8h
milvus-minio-7f9f9b4d76-949cr 1/1 Running 0 2d8h
milvus-proxy-57c54df4d5-ggcnk 1/1 Running 2 (2d8h ago) 2d8h
milvus-querycoord-85845fcf56-kt7h5 1/1 Running 2 (2d8h ago) 2d8h
milvus-querynode-67fff4f47d-82482 1/1 Running 2 (2d8h ago) 2d8h
milvus-rootcoord-bd7958978-ssrmz 1/1 Running 2 (2d8h ago) 2d8h
ollama-85cf4f777b-r9nkv 1/1 Running 0 79m
open-webui-0 1/1 Running 0 2m45s
open-webui-redis-7c65fd96bb-67n6s 1/1 Running 0 2d6h

测试 Instrumented Application

Open WebUI Pipeline 是一个模型推理/处理后端,用来扩展除了 OpenAI API 以外的模型来源,如本地部署的 Ollama 等。

Python OpenLIT SDK 是一个基于 OpenTelemetry 封装的可观测性采集工具,专门用来监控和追踪大语言模型(LLM)应用的运行情况。

此处使用 Open WebUI Pipeline 调用 Python 代码,再由 Python 调用 Ollama,并通过 OpenLIT SDK 生成数据传输到 OpenTelemetry Collector 中。

Python 代码参考:https://github.com/SUSE/suse-ai-observability-extension/blob/main/integrations/oi-pipelines/openlit.py

Open WebUI 需要启用 Pipeline,在 owui_custom_overrides.yaml 中修改:

1
2
3
4
5
6
7
8
9
pipelines:
enabled: true
# 也可以不配置,后续在 UI 中上传
extraEnvVars:
- name: PIPELINES_URLS
value: "https://github.com/SUSE/suse-ai-observability-extension/blob/main/integrations/oi-pipelines/openlit.py"
# 如无设置,默认为 0p3n-w3bu!
- name: PIPELINES_API_KEY
value: "xxx"

更新后会生成 Pipeline Workload:

1
2
root@suse-o11y-01:~# kubectl -n suse-private-ai get pod | grep pipeline
open-webui-pipelines-5b87fb8b8b-fnn62 1/1 Running 0 44m

在 Open WebUI 中设置凭证,默认情况下密码为 0p3n-w3bu!

在 Pipeline 界面,查看 Open WebUI 是否可以连接到 Pipeline:

在新对话中,模型选择 Instrumented,尝试进行对话:

在 SUSE O11y 界面中,即可看到 Python OpenLIT SDK 生成的数据:

查看 GPU 节点监控,可以看到 GPU 使用信息:

部署测试 GenAI Observability Demo

参考 Helm Chart:https://github.com/SUSE/suse-ai-observability-extension/tree/main/setup/genai-observability-demo

准备配置文件:

1
2
3
4
5
6
7
8
9
10
11
12
cat <<EOF > genai_observability_demo_values.yaml
ollamaEndpoint: 'http://ollama.suse-private-ai.svc.cluster.local:11434'
otlpEndpoint: 'http://opentelemetry-collector.observability.svc.cluster.local:4318'
model: "llama3.2:3b"
resources:
requests:
memory: '200Mi'
cpu: '300m'
limits:
memory: '2000Mi'
cpu: '4000m'
EOF

执行安装:

1
helm -n suse-private-ai upgrade --install genai-observability-demo genai-observability-demo -f genai_observability_demo_values.yaml
1
2
3
4
5
6
7
8
9
10
11
12
root@suse-o11y-01:~# helm -n suse-private-ai ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
genai-observability-demo suse-private-ai 1 2025-08-15 14:02:52.04422985 +0800 CST deployed genai-observability-demo-0.1.20.0.6
milvus suse-private-ai 1 2025-08-12 10:15:32.150413955 +0800 CST deployed milvus-4.2.2 2.4.6
ollama suse-private-ai 1 2025-08-12 11:34:18.24052066 +0800 CST deployed ollama-1.16.0 0.6.8
open-webui suse-private-ai 11 2025-08-14 19:47:10.425548195 +0800 CST deployed open-webui-6.13.0 0.6.9

root@suse-o11y-01:~# kubectl get pod -n suse-private-ai -l app.kubernetes.io/name=genai-observability-demo
NAME READY STATUS RESTARTS AGE
genai-observability-demo-rag101-55f7b8db5f-8djqr 1/1 Running 0 6m53s
genai-observability-demo-rag102-5845f6bfc9-8d876 1/1 Running 0 6m53s
genai-observability-demo-simple-bb97d9bfb-5xp98 1/1 Running 0 6m53s

在 SUSE O11y 查看数据:

Author

Warner Chen

Posted on

2025-08-14

Updated on

2025-08-15

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.