Management Cluster Ceph » Historique » Version 40
Version 39 (Mehdi Abaakouk, 07/09/2018 20:30) → Version 40/56 (Mehdi Abaakouk, 10/12/2018 12:23)
{{>toc}}
h1. Management Cluster Ceph
h2. Liens
* [[Openstack Management TTNN]]
* [[Openstack Setup VM pas dans openstack]]
* [[Openstack Installation nouvelle node du cluster]]
* [[Openstack Installation TTNN]]
* "Openstack tools for ttnn":/projects/git-tetaneutral-net/repository/openstack-tools
h2. Ajout d'un OSD classique
<pre>
$ ceph-disk prepare --zap-disk --cluster-uuid 1fe74663-8dfa-486c-bb80-3bd94c90c967 --fs-type=ext4 /dev/sdX
$ tune2fs -c 0 -i 0 -m 0 /dev/sdX1
$ smartctl --smart=on /dev/sdX # Pour le monitoring.
</pre>
Récuperer l'id avec (c'est celui tout en bas pas accroché à l'arbre):
<pre>
ceph osd tree
</pre>
*DEBUT WORKAROUND BUG PREPARE*
Dans le cas ou l'osd est DOWN après le prepare c'est surement ce bug
ID est le premier numero libre d'osd en partant de zero (en bas du ceph osd tree)
<pre>
mkdir /var/lib/ceph/osd/ceph-<ID>
chown ceph:ceph /var/lib/ceph/osd/ceph-<ID>
ceph-disk activate /dev/sd<X>1
systemctl status ceph-osd@<ID>
</pre>
*FIN WORKAROUND BUG PREPARE*
Pour un HDD:
<pre>
$ ceph osd crush add osd.<ID> 0 root=default host=<host>
</pre>
Pour un SSD:
<pre>
$ ceph osd crush add osd.<ID> 0 root=ssd host=<host>-ssd
</pre>
Ensuite, autoriser Ceph à mettre des data dessus:
<pre>
$ /root/tools/ceph-reweight-osds.sh osd.<ID>
</pre>
h3. helper
<pre>
add_osd(){
dev="$1"
type="$2"
host=$(hostname -s)
[ "$type" == "ssd" ] && host="${host}-ssd"
found=0 ; next=-1 ; for i in $(ceph osd ls); do next=$((next+1)) ; [ $(($i - $next)) -gt 0 ] && found=1 && break; done ; [ $found -eq 0 ] && next=$((next+1))
mkdir /var/lib/ceph/osd/ceph-$next
chown ceph:ceph /var/lib/ceph/osd/ceph-$next
ceph-disk prepare --zap-disk --filestore --cluster-uuid 1fe74663-8dfa-486c-bb80-3bd94c90c967 --fs-type=ext4 $dev
tune2fs -c 0 -i 0 -m 0 ${dev}1
smartctl --smart=on $dev
systemctl start ceph-osd@$next
systemctl status ceph-osd@$next
sleep 1
ceph osd crush add osd.$next 0 root=${type} host=${host}
}
</pre>
h2. Vider un OSD:
<pre>
vider_osd(){
name="$1"
ceph osd out ${name}
ceph osd crush reweight ${name} 0
ceph osd reweight ${name} 0
}
</pre>
h2. Suppression d'un OSD:
<pre>
remove_osd(){
name="$1"
ceph osd out ${name}
systemctl stop ceph-osd@${name#osd.}
ceph osd crush remove ${name}
ceph auth del ${name}
ceph osd rm ${name}
ceph osd tree
}
</pre>
h2. Arrêter les IO de recovery:
<pre>
ceph osd set nobackfill
ceph osd set norebalance
ceph osd set norecover
</pre>
h2. Procédure d'upgrade
+
_*/!\Lire la release note (contient très très souvent des trucs à faire en plus) /!\*_+
h4. Upgrade des MONs:
Mettre le flags noout:
<pre>ceph osd set noout</pre>
Sur chaque MONs (g1/g2/g3)
<pre>
apt-get upgrade -y
systemctl restart ceph-mon@g*
ceph -s
</pre>
Note que seulement le node 'leader/master' va provoquer une micro/nano coupure, souvent c'est même invisible.
h4. Upgrade des OSDs:
Pour chaque machine
<pre>
apt-get upgrade -y
systemctl restart ceph-osd@*
</pre>
Puis attendre que le recovery termine avant de faire la suivante.
Une fois toutes les OSDs upgrader et relancer, faire:
<pre>ceph osd unset noout</pre>
h2. Remplacement à froid d'un tier cache:
upstream doc: http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
<pre>
ceph osd tier cache-mode ec8p2c forward
rados -p ec8p2c cache-flush-evict-all
ceph osd tier remove-overlay ec8p2
ceph osd tier remove ec8p2 ec8p2c
rados rmpool ec8p2c ec8p2c --yes-i-really-really-mean-ita
ceph osd pool create ec8p2c 128 128 replicated
ceph osd tier add ec8p2 ec8p2c
ceph osd tier cache-mode ec8p2c writeback
ceph osd tier set-overlay ec8p2 ec8p2c
ceph osd pool set ec8p2c size 3
ceph osd pool set ec8p2c min_size 2
ceph osd pool set ec8p2c hit_set_type bloom
ceph osd pool set ec8p2c hit_set_count 1
ceph osd pool set ec8p2c hit_set_period 3600
ceph osd pool set ec8p2c target_max_bytes 200000000000
ceph osd pool set ec8p2c target_max_objects 10000000
ceph osd pool set ec8p2c cache_target_dirty_ratio 0.4
ceph osd pool set ec8p2c cache_target_full_ratio 0.8
</pre>
h2. Ajout d'un OSD qui partage le SSD avec l'OS (OBSOLETE PLUS COMPATIBLE AVEC LES FUTURES VERSION DE CEPH)
En général avec ceph, on donne un disque, ceph créé 2 partitions une pour le journal de l'OSD, l'autre pour les datas
mais pour le SSD de tetaneutral qui a aussi l'OS, voici la méthode
Création manuelle de la partition de data ceph /dev/sda2 ici
Debian (MBR format):
<pre>
apt-get install partprobe
fdisk /dev/sda
n
p
<enter>
<enter>
<enter>
<enter>
w
$ partprobe
</pre>
Ubuntu (GPT format):
<pre>
# parted /dev/sdb
GNU Parted 2.3
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ATA SAMSUNG MZ7KM480 (scsi)
Disk /dev/sdb: 480GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 20.0GB 20.0GB primary ext4 boot
2 20.0GB 36.0GB 16.0GB primary linux-swap(v1)
(parted) mkpart
Partition type? primary/extended?
Partition type? primary/extended? primary
File system type? [ext2]? xfs
Start?
Start? 36.0GB
End? 100%
(parted) print
Model: ATA SAMSUNG MZ7KM480 (scsi)
Disk /dev/sdb: 480GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 20.0GB 20.0GB primary ext4 boot
2 20.0GB 36.0GB 16.0GB primary linux-swap(v1)
3 36.0GB 480GB 444GB primary
(parted) quit
Information: You may need to update /etc/fstab.
</pre>
On prepare le disk comme normalement
<pre>
ceph-disk prepare --fs-type=ext4 --cluster-uuid 1fe74663-8dfa-486c-bb80-3bd94c90c967 /dev/sda2
ceph-disk activate /dev/sda2
ceph osd crush add osd.<ID> 0 root=ssd host=g3-ssd
</pre>
Ensuite, autoriser Ceph à mettre des data dessus:
<pre>
$ /root/tools/ceph-reweight-osds.sh osd.<ID>
</pre>
h2. inconsistent pg
* Analyse d'une erreur de coherence detectee par ceph
** https://lists.tetaneutral.net/pipermail/technique/2017-August/002859.html
<pre>
root@g1:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 58.22d is active+clean+inconsistent, acting [9,47,37]
2 scrub errors
root@g1:~# rados list-inconsistent-obj 58.22d --format=json-pretty
{
"epoch": 269000,
"inconsistents": [
{
"object": {
"name": "rbd_data.11f20f75aac8266.00000000000f79f9",
"nspace": "",
"locator": "",
"snap": "head",
"version": 9894452
},
"errors": [
"data_digest_mismatch"
],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info":
"58:b453643a:::rbd_data.11f20f75aac8266.00000000000f79f9:head(261163'9281748 osd.9.0:6221608 dirty|data_digest|omap_digest s 4194304 uv 9894452 dd 2193d055 od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 9,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x2193d055"
},
{
"osd": 37,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x05891fb4"
},
{
"osd": 47,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x2193d055"
}
]
}
]
}
root@g1:~# ceph osd map disks rbd_data.11f20f75aac8266.00000000000f79f9
osdmap e269110 pool 'disks' (58) object 'rbd_data.11f20f75aac8266.00000000000f79f9' -> pg 58.5c26ca2d (58.22d) -> up ([9,47,37], p9) acting ([9,47,37], p9)
root@g8:/var/lib/ceph/osd/ceph-9/current/58.22d_head# find . -name '*11f20f75aac8266.00000000000f79f9*'
./DIR_D/DIR_2/DIR_A/DIR_C/rbd\udata.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a
root@g10:/var/lib/ceph/osd/ceph-37/current/58.22d_head# find . -name '*11f20f75aac8266.00000000000f79f9*'
./DIR_D/DIR_2/DIR_A/DIR_C/rbd\udata.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a
$ scp g8:/var/lib/ceph/osd/ceph-9/current/58.22d_head/DIR_D/DIR_2/DIR_A/DIR_C/rbd*data.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a g8data
$ scp g10:/var/lib/ceph/osd/ceph-37/current/58.22d_head/DIR_D/DIR_2/DIR_A/DIR_C/rbd*data.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a g10data
$ md5sum *
bd85c0ef1f30829ce07e5f9152ac2d2f g10data
4297d0bc373e6603e0ad842702e0ecaa g8data
$ $ diff -u <(od -x g10data) <(od -x g8data)
--- /dev/fd/63 2017-08-13 10:43:52.837097740 +0200
+++ /dev/fd/62 2017-08-13 10:43:52.833097808 +0200
@@ -2617,7 +2617,7 @@
0121600 439b 14f4 bb4c 5f14 6ff7 4393 9ff8 a9a9
0121620 29a8 56a4 1133 b6a8 2206 4821 2f42 4b2c
0121640 3d86 41a2 785f 9785 8b48 4243 e7b9 f0aa
-0121660 29b6 be0c 0455 bf97 1c0d 49e5 75dd e1ed
+0121660 29a6 be0c 0455 bf97 1c0d 49e5 75dd e1ed
0121700 2519 d6ac 1047 1111 0344 38be 27a1 db07
0121720 dff6 c002 75d8 4396 6154 eba9 3abd 5d20
0121740 8ae4 e63a 298b d754 0208 9705 1bb8 3685
</pre>
Donc un seul bit flip 29b6 vs 29a6
<pre>
>>> bin(0xa)
'0b1010'
>>> bin(0xb)
'0b1011'
</pre>
* http://cephnotes.ksperis.com/blog/2013/08/20/ceph-osd-where-is-my-data
* https://superuser.com/questions/969889/what-is-the-granularity-of-a-hard-disk-ure-unrecoverable-read-error
h1. Management Cluster Ceph
h2. Liens
* [[Openstack Management TTNN]]
* [[Openstack Setup VM pas dans openstack]]
* [[Openstack Installation nouvelle node du cluster]]
* [[Openstack Installation TTNN]]
* "Openstack tools for ttnn":/projects/git-tetaneutral-net/repository/openstack-tools
h2. Ajout d'un OSD classique
<pre>
$ ceph-disk prepare --zap-disk --cluster-uuid 1fe74663-8dfa-486c-bb80-3bd94c90c967 --fs-type=ext4 /dev/sdX
$ tune2fs -c 0 -i 0 -m 0 /dev/sdX1
$ smartctl --smart=on /dev/sdX # Pour le monitoring.
</pre>
Récuperer l'id avec (c'est celui tout en bas pas accroché à l'arbre):
<pre>
ceph osd tree
</pre>
*DEBUT WORKAROUND BUG PREPARE*
Dans le cas ou l'osd est DOWN après le prepare c'est surement ce bug
ID est le premier numero libre d'osd en partant de zero (en bas du ceph osd tree)
<pre>
mkdir /var/lib/ceph/osd/ceph-<ID>
chown ceph:ceph /var/lib/ceph/osd/ceph-<ID>
ceph-disk activate /dev/sd<X>1
systemctl status ceph-osd@<ID>
</pre>
*FIN WORKAROUND BUG PREPARE*
Pour un HDD:
<pre>
$ ceph osd crush add osd.<ID> 0 root=default host=<host>
</pre>
Pour un SSD:
<pre>
$ ceph osd crush add osd.<ID> 0 root=ssd host=<host>-ssd
</pre>
Ensuite, autoriser Ceph à mettre des data dessus:
<pre>
$ /root/tools/ceph-reweight-osds.sh osd.<ID>
</pre>
h3. helper
<pre>
add_osd(){
dev="$1"
type="$2"
host=$(hostname -s)
[ "$type" == "ssd" ] && host="${host}-ssd"
found=0 ; next=-1 ; for i in $(ceph osd ls); do next=$((next+1)) ; [ $(($i - $next)) -gt 0 ] && found=1 && break; done ; [ $found -eq 0 ] && next=$((next+1))
mkdir /var/lib/ceph/osd/ceph-$next
chown ceph:ceph /var/lib/ceph/osd/ceph-$next
ceph-disk prepare --zap-disk --filestore --cluster-uuid 1fe74663-8dfa-486c-bb80-3bd94c90c967 --fs-type=ext4 $dev
tune2fs -c 0 -i 0 -m 0 ${dev}1
smartctl --smart=on $dev
systemctl start ceph-osd@$next
systemctl status ceph-osd@$next
sleep 1
ceph osd crush add osd.$next 0 root=${type} host=${host}
}
</pre>
h2. Vider un OSD:
<pre>
vider_osd(){
name="$1"
ceph osd out ${name}
ceph osd crush reweight ${name} 0
ceph osd reweight ${name} 0
}
</pre>
h2. Suppression d'un OSD:
<pre>
remove_osd(){
name="$1"
ceph osd out ${name}
systemctl stop ceph-osd@${name#osd.}
ceph osd crush remove ${name}
ceph auth del ${name}
ceph osd rm ${name}
ceph osd tree
}
</pre>
h2. Arrêter les IO de recovery:
<pre>
ceph osd set nobackfill
ceph osd set norebalance
ceph osd set norecover
</pre>
h2. Procédure d'upgrade
+
_*/!\Lire la release note (contient très très souvent des trucs à faire en plus) /!\*_+
h4. Upgrade des MONs:
Mettre le flags noout:
<pre>ceph osd set noout</pre>
Sur chaque MONs (g1/g2/g3)
<pre>
apt-get upgrade -y
systemctl restart ceph-mon@g*
ceph -s
</pre>
Note que seulement le node 'leader/master' va provoquer une micro/nano coupure, souvent c'est même invisible.
h4. Upgrade des OSDs:
Pour chaque machine
<pre>
apt-get upgrade -y
systemctl restart ceph-osd@*
</pre>
Puis attendre que le recovery termine avant de faire la suivante.
Une fois toutes les OSDs upgrader et relancer, faire:
<pre>ceph osd unset noout</pre>
h2. Remplacement à froid d'un tier cache:
upstream doc: http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
<pre>
ceph osd tier cache-mode ec8p2c forward
rados -p ec8p2c cache-flush-evict-all
ceph osd tier remove-overlay ec8p2
ceph osd tier remove ec8p2 ec8p2c
rados rmpool ec8p2c ec8p2c --yes-i-really-really-mean-ita
ceph osd pool create ec8p2c 128 128 replicated
ceph osd tier add ec8p2 ec8p2c
ceph osd tier cache-mode ec8p2c writeback
ceph osd tier set-overlay ec8p2 ec8p2c
ceph osd pool set ec8p2c size 3
ceph osd pool set ec8p2c min_size 2
ceph osd pool set ec8p2c hit_set_type bloom
ceph osd pool set ec8p2c hit_set_count 1
ceph osd pool set ec8p2c hit_set_period 3600
ceph osd pool set ec8p2c target_max_bytes 200000000000
ceph osd pool set ec8p2c target_max_objects 10000000
ceph osd pool set ec8p2c cache_target_dirty_ratio 0.4
ceph osd pool set ec8p2c cache_target_full_ratio 0.8
</pre>
h2. Ajout d'un OSD qui partage le SSD avec l'OS (OBSOLETE PLUS COMPATIBLE AVEC LES FUTURES VERSION DE CEPH)
En général avec ceph, on donne un disque, ceph créé 2 partitions une pour le journal de l'OSD, l'autre pour les datas
mais pour le SSD de tetaneutral qui a aussi l'OS, voici la méthode
Création manuelle de la partition de data ceph /dev/sda2 ici
Debian (MBR format):
<pre>
apt-get install partprobe
fdisk /dev/sda
n
p
<enter>
<enter>
<enter>
<enter>
w
$ partprobe
</pre>
Ubuntu (GPT format):
<pre>
# parted /dev/sdb
GNU Parted 2.3
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ATA SAMSUNG MZ7KM480 (scsi)
Disk /dev/sdb: 480GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 20.0GB 20.0GB primary ext4 boot
2 20.0GB 36.0GB 16.0GB primary linux-swap(v1)
(parted) mkpart
Partition type? primary/extended?
Partition type? primary/extended? primary
File system type? [ext2]? xfs
Start?
Start? 36.0GB
End? 100%
(parted) print
Model: ATA SAMSUNG MZ7KM480 (scsi)
Disk /dev/sdb: 480GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 20.0GB 20.0GB primary ext4 boot
2 20.0GB 36.0GB 16.0GB primary linux-swap(v1)
3 36.0GB 480GB 444GB primary
(parted) quit
Information: You may need to update /etc/fstab.
</pre>
On prepare le disk comme normalement
<pre>
ceph-disk prepare --fs-type=ext4 --cluster-uuid 1fe74663-8dfa-486c-bb80-3bd94c90c967 /dev/sda2
ceph-disk activate /dev/sda2
ceph osd crush add osd.<ID> 0 root=ssd host=g3-ssd
</pre>
Ensuite, autoriser Ceph à mettre des data dessus:
<pre>
$ /root/tools/ceph-reweight-osds.sh osd.<ID>
</pre>
h2. inconsistent pg
* Analyse d'une erreur de coherence detectee par ceph
** https://lists.tetaneutral.net/pipermail/technique/2017-August/002859.html
<pre>
root@g1:~# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 58.22d is active+clean+inconsistent, acting [9,47,37]
2 scrub errors
root@g1:~# rados list-inconsistent-obj 58.22d --format=json-pretty
{
"epoch": 269000,
"inconsistents": [
{
"object": {
"name": "rbd_data.11f20f75aac8266.00000000000f79f9",
"nspace": "",
"locator": "",
"snap": "head",
"version": 9894452
},
"errors": [
"data_digest_mismatch"
],
"union_shard_errors": [
"data_digest_mismatch_oi"
],
"selected_object_info":
"58:b453643a:::rbd_data.11f20f75aac8266.00000000000f79f9:head(261163'9281748 osd.9.0:6221608 dirty|data_digest|omap_digest s 4194304 uv 9894452 dd 2193d055 od ffffffff alloc_hint [0 0])",
"shards": [
{
"osd": 9,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x2193d055"
},
{
"osd": 37,
"errors": [
"data_digest_mismatch_oi"
],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x05891fb4"
},
{
"osd": 47,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0x2193d055"
}
]
}
]
}
root@g1:~# ceph osd map disks rbd_data.11f20f75aac8266.00000000000f79f9
osdmap e269110 pool 'disks' (58) object 'rbd_data.11f20f75aac8266.00000000000f79f9' -> pg 58.5c26ca2d (58.22d) -> up ([9,47,37], p9) acting ([9,47,37], p9)
root@g8:/var/lib/ceph/osd/ceph-9/current/58.22d_head# find . -name '*11f20f75aac8266.00000000000f79f9*'
./DIR_D/DIR_2/DIR_A/DIR_C/rbd\udata.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a
root@g10:/var/lib/ceph/osd/ceph-37/current/58.22d_head# find . -name '*11f20f75aac8266.00000000000f79f9*'
./DIR_D/DIR_2/DIR_A/DIR_C/rbd\udata.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a
$ scp g8:/var/lib/ceph/osd/ceph-9/current/58.22d_head/DIR_D/DIR_2/DIR_A/DIR_C/rbd*data.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a g8data
$ scp g10:/var/lib/ceph/osd/ceph-37/current/58.22d_head/DIR_D/DIR_2/DIR_A/DIR_C/rbd*data.11f20f75aac8266.00000000000f79f9__head_5C26CA2D__3a g10data
$ md5sum *
bd85c0ef1f30829ce07e5f9152ac2d2f g10data
4297d0bc373e6603e0ad842702e0ecaa g8data
$ $ diff -u <(od -x g10data) <(od -x g8data)
--- /dev/fd/63 2017-08-13 10:43:52.837097740 +0200
+++ /dev/fd/62 2017-08-13 10:43:52.833097808 +0200
@@ -2617,7 +2617,7 @@
0121600 439b 14f4 bb4c 5f14 6ff7 4393 9ff8 a9a9
0121620 29a8 56a4 1133 b6a8 2206 4821 2f42 4b2c
0121640 3d86 41a2 785f 9785 8b48 4243 e7b9 f0aa
-0121660 29b6 be0c 0455 bf97 1c0d 49e5 75dd e1ed
+0121660 29a6 be0c 0455 bf97 1c0d 49e5 75dd e1ed
0121700 2519 d6ac 1047 1111 0344 38be 27a1 db07
0121720 dff6 c002 75d8 4396 6154 eba9 3abd 5d20
0121740 8ae4 e63a 298b d754 0208 9705 1bb8 3685
</pre>
Donc un seul bit flip 29b6 vs 29a6
<pre>
>>> bin(0xa)
'0b1010'
>>> bin(0xb)
'0b1011'
</pre>
* http://cephnotes.ksperis.com/blog/2013/08/20/ceph-osd-where-is-my-data
* https://superuser.com/questions/969889/what-is-the-granularity-of-a-hard-disk-ure-unrecoverable-read-error