通过 Target Allocator 收集 Service/Pod Monitor Metrics

场景:SUSE O11y Agent 默认不会收集 Service/Pod Monitor 获取的 Metrics,最简单的方法是通过 Prometheus remote write 的方式将这些 Metrics 传输到 SUSE O11y Server 的 VictoriaMetrics。但集群中共同存在 Prometheus 和 VictoriaMetrics 很耗费资源,这时候可以借助 Open Telemetry 的 Target Allocator 能力,代替 Prometheus 收集这些指标。

参考文档:

  1. https://github.com/open-telemetry/opentelemetry-operator/blob/main/cmd/otel-allocator/README.md#target-allocator
  2. https://opentelemetry.io/docs/platforms/kubernetes/operator/troubleshooting/target-allocator/

首先需要在集群安装 Open Telemetry Operator,参考文档:https://docs.stackstate.com/open-telemetry/getting-started/getting-started-k8s-operator

创建 TA 使用的 RBAC:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: opentelemetry-targetallocator-cluster-role
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
- secrets
verbs: ["get"]
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs: ["get", "list", "watch"]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
- apiGroups:
- monitoring.coreos.com
resources:
- servicemonitors
- podmonitors
- probes
- scrapeconfigs
verbs:
- '*'
- apiGroups: [""]
resources:
- namespaces
verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: opentelemetry-targetallocator-cluster-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: opentelemetry-targetallocator-cluster-role
subjects:
- kind: ServiceAccount
name: opentelemetry-targetallocator-sa
namespace: open-telemetry

---
apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
name: opentelemetry-targetallocator-sa
namespace: open-telemetry
EOF

部署 Open Telemetry Collector,需要注意 mode 要为 statefulset/daemonset,参考文档:https://docs.stackstate.com/open-telemetry/getting-started/getting-started-k8s-operator#the-open-telemetry-collector

1
2
3
4
kubectl create namespace open-telemetry
kubectl create secret generic open-telemetry-collector \
--namespace open-telemetry \
--from-literal=API_KEY='<suse-observability-api-key>'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
cat <<EOF | kubectl apply -f -
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: open-telemetry
spec:
config:
connectors:
spanmetrics:
metrics_expiration: 5m
namespace: otel_span
exporters:
debug: {}
nop: {}
otlp/suse-observability:
auth:
authenticator: bearertokenauth
# 修改为实际的 SUSE O11y Server 集群的 otlp ingress host
endpoint: suse-observability-otlp.warnerchen.com:443
tls:
insecure_skip_verify: true
compression: snappy
otlphttp/suse-observability:
auth:
authenticator: bearertokenauth
# 修改为实际的 SUSE O11y Server 集群的 otlp http ingress host
endpoint: https://suse-observability-otlp-http.warnerchen.com
tls:
insecure_skip_verify: true
compression: snappy
extensions:
bearertokenauth:
scheme: SUSEObservability
token: "${env:API_KEY}"
health_check:
endpoint: 0.0.0.0:13133
path: /
processors:
batch: {}
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
resource:
attributes:
- action: upsert
key: k8s.cluster.name
# 修改为实际的集群名称
value: <your-cluster-name>
- action: insert
from_attribute: k8s.pod.uid
key: service.instance.id
- action: insert
from_attribute: k8s.namespace.name
key: service.namespace
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: opentelemetry-collector
scrape_interval: 10s
static_configs:
- targets:
- 0.0.0.0:8888
service:
extensions:
- health_check
- bearertokenauth
pipelines:
logs:
exporters:
- nop
receivers:
- otlp
metrics:
exporters:
- debug
- otlp/suse-observability
processors:
- memory_limiter
- resource
- batch
receivers:
- otlp
- spanmetrics
- prometheus
traces:
exporters:
- debug
- spanmetrics
- otlp/suse-observability
processors:
- memory_limiter
- resource
- batch
receivers:
- otlp
telemetry:
metrics:
address: 0.0.0.0:8888
configVersions: 3
daemonSetUpdateStrategy: {}
deploymentUpdateStrategy: {}
envFrom:
- secretRef:
name: open-telemetry-collector
image: otel/opentelemetry-collector-k8s:0.123.0
ingress:
route: {}
ipFamilyPolicy: SingleStack
managementState: managed
# 设置为 statefulset 启动
mode: statefulset
observability:
metrics: {}
podDnsConfig: {}
replicas: 1
resources: {}
targetAllocator:
allocationStrategy: consistent-hashing
# 开启 TA
enabled: true
filterStrategy: relabel-config
observability:
metrics: {}
prometheusCR:
# 启用 Prometheus CR
enabled: true
# 收集所有 Pod Monitor
podMonitorSelector: {}
scrapeInterval: 30s
# 收集所有 Service Monitor
serviceMonitorSelector: {}
replicas: 1
resources: {}
# 使用前面创建的 RBAC
serviceAccount: opentelemetry-targetallocator-sa
upgradeStrategy: automatic
EOF

完成部署后,检查 Pod 运行情况:

1
2
3
4
5
root@test-1:~# kubectl -n open-telemetry get pod
NAME READY STATUS RESTARTS AGE
opentelemetry-operator-5576fd4499-wvxxr 2/2 Running 0 95m
otel-collector-collector-0 1/1 Running 0 39m
otel-collector-targetallocator-7d977956fb-6f6nc 1/1 Running 0 14m

此时集群中如果有 Pod/Service Monitor 的话,TA 便会自动获取,日志如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Starting the Target Allocator"}
{"level":"info","ts":"2025-05-15T07:46:43Z","logger":"allocator","msg":"Starting server..."}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Waiting for caches to sync for namespace"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Caches are synced for namespace"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Waiting for caches to sync for servicemonitors"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Caches are synced for servicemonitors"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Waiting for caches to sync for podmonitors"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Caches are synced for podmonitors"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Waiting for caches to sync for probes"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Caches are synced for probes"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Waiting for caches to sync for scrapeconfigs"}
{"level":"info","ts":"2025-05-15T07:46:43Z","msg":"Caches are synced for scrapeconfigs"}
{"level":"info","ts":"2025-05-15T07:46:48Z","logger":"allocator","msg":"Service Discovery watch event received","targets groups":1}
{"level":"info","ts":"2025-05-15T07:46:53Z","logger":"allocator","msg":"Service Discovery watch event received","targets groups":21}
{"level":"info","ts":"2025-05-15T07:56:48Z","logger":"allocator","msg":"Service Discovery watch event received","targets groups":21}
{"level":"info","ts":"2025-05-15T08:01:48Z","logger":"allocator","msg":"Service Discovery watch event received","targets groups":21}

Open Telemetry Collector 日志如下:

1
2
3
4
5
6
7
8
9
10
11
12
...
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-kubelet/1"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-prometheus-node-exporter/0"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-fleet-system/monitoring-fleet-controller/0"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-coredns/0"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-kubelet/2"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-prometheus/0"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-prometheus/1"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-windows-exporter/0"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-fleet-system/monitoring-gitops-controller/0"}
2025-05-15T07:38:13.850Z info targetallocator/manager.go:184 Scrape job added {"jobName": "serviceMonitor/cattle-monitoring-system/rancher-monitoring-kube-controller-manager/0"}
...

也可以通过调用 TA 的接口,检查是否有发现 Pod/Service Monitor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
root@test-1:~# curl -s <otel-collector-targetallocator_cluster_ip>/jobs | jq
{
"opentelemetry-collector": {
"_link": "/jobs/opentelemetry-collector/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-alertmanager/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-alertmanager%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-apiserver/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-apiserver%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-coredns/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-coredns%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-grafana/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-grafana%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kube-controller-manager/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kube-controller-manager%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kube-etcd/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kube-etcd%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kube-proxy/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kube-proxy%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kube-scheduler/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kube-scheduler%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kube-state-metrics/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kube-state-metrics%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kubelet/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kubelet%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kubelet/1": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kubelet%2F1/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-kubelet/2": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-kubelet%2F2/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-operator/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-operator%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-prometheus-node-exporter/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-prometheus-node-exporter%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-prometheus/0": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-prometheus%2F0/targets"
},
"serviceMonitor/cattle-monitoring-system/rancher-monitoring-prometheus/1": {
"_link": "/jobs/serviceMonitor%2Fcattle-monitoring-system%2Francher-monitoring-prometheus%2F1/targets"
},
"serviceMonitor/kube-system/rancher-monitoring-ingress-nginx/0": {
"_link": "/jobs/serviceMonitor%2Fkube-system%2Francher-monitoring-ingress-nginx%2F0/targets"
}
}

随后即可在 SUSE O11y 搜索这些 Metrics:

Author

Warner Chen

Posted on

2025-05-15

Updated on

2025-05-29

Licensed under

You need to set install_url to use ShareThis. Please set it in _config.yml.
You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.