root@utumno:~# fio ~/rw4k randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio 1.59 Starting 1 process Jobs: 1 (f=1): [w] [100.0% done] [0K/49885K /s] [0 /12.2K iops] [eta 00m:00s] randwrite: (groupid=0, jobs=1): err= 0: pid=1770 write: io=8192.3MB, bw=47666KB/s, iops=11916 , runt=175991msec cpu : usr=4.33%, sys=14.28%, ctx=2071968, majf=0, minf=19 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=0/2097215/0, short=0/0/0 Run status group 0 (all jobs): WRITE: io=8192.3MB, aggrb=47666KB/s, minb=48810KB/s, maxb=48810KB/s, mint=175991msec, maxt=175991msec Disk stats (read/write): sdb: ios=69/2097888, merge=0/3569, ticks=0/11243992, in_queue=11245600, util=99.99%添加了bcache:
root@utumno:~# fio ~/rw4k randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio 1.59 Starting 1 process Jobs: 1 (f=1): [w] [100.0% done] [0K/75776K /s] [0 /18.5K iops] [eta 00m:00s] randwrite: (groupid=0, jobs=1): err= 0: pid=1914 write: io=8192.3MB, bw=83069KB/s, iops=20767 , runt=100987msec cpu : usr=3.17%, sys=13.27%, ctx=456026, majf=0, minf=19 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued r/w/d: total=0/2097215/0, short=0/0/0 Run status group 0 (all jobs): WRITE: io=8192.3MB, aggrb=83068KB/s, minb=85062KB/s, maxb=85062KB/s, mint=100987msec, maxt=100987msec Disk stats (read/write): bcache0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%添加了bcache之後IOPS為18.5K,裸SSD設備為12.2K。bcache表現更佳是因為bcache按順序將寫請求發送到SSD,但額外加入了更新索引的開銷。bcache對隨機寫做了優化,bcache還從高IO深度(64)獲益,因為在高IO深度的情況下就可以將多次下標更新合並為一次寫請求。 高IO深度就代表著高系統負載,當IO深度下調時IOPS也出現變化:
IO depth of 32: bcache 20.3k iops, raw ssd 19.8k iops
IO depth of 16: bcache 16.7k iops, raw ssd 23.5k iops
IO depth of 8: bcache 8.7k iops, raw ssd 14.9k iops
IO depth of 4: bcache 8.9k iops, raw ssd 19.7k iops
SSD性能在不同IO深度時出現了波動。對於不同的寫模型會有不同的結果,我們只關注兩者的相對數值。IO depth of 64: bcache 29.5k iops, raw ssd 25.4k iops
IO depth of 16: bcache 28.2k iops, raw ssd 27.6k iops
bcache略勝一籌,可能跟要讀的數據相關。這裡的結論是隨機讀時bcache與裸SSD讀性能是相同的。 這裡要注意的是,讀4K隨機寫下去的數據,這樣的測試模型對於bcache是不好的。這意味btree都是4K大小,btree將比通常時候大得多。在實際應用中,平均大小是100K。btree變大就意味著索引占用更大的內存空間,並且有一部分是在二級索引。根據個人經驗這些開銷在大型機器IOPS超過500K時才會有實際影響。 如果大家有其他的測試方法或者我的測試方法中有什麼問題請通知郵件告訴我。# cat /usr/lib/udev/rules.d/61-bcache.rules .... # Backing devices: scan, symlink, register IMPORT{program}="/sbin/blkid -o udev $tempnode" # blkid and probe-bcache can disagree, in which case don't register ENV{ID_FS_TYPE}=="?*", ENV{ID_FS_TYPE}!="bcache", GOTO="bcache_backing_end" ... # lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUID NAME MAJ:MIN RM SIZE TYPE FSTYPE MOUNTPOINT UUID PARTUUID sda 8:0 0 111.8G disk ├─sda1 8:1 0 3G part vfat /esp 7E67-C0BB d39828e8-4880-4c85-9ec0-4255777aa35b └─sda2 8:2 0 108.8G part ext2 93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65 sdb 8:16 0 931.5G disk └─sdb1 8:17 0 931.5G part ntfs FAD2B75FD2B71EB7 90c80e9d-f31a-41b4-9d4d-9b02029402b2 sdc 8:32 0 2.7T disk bcache 4bd63488-e1d7-4858-8c70-a35a5ba2c452 └─bcache1 254:1 0 2.7T disk btrfs 2ff19aaf-852e-4c58-9eee-3daecbc6a5a1 sdd 8:48 0 2.7T disk bcache ce6de517-7538-45d6-b8c4-8546f13f76c1 └─bcache0 254:0 0 2.7T disk btrfs 2ff19aaf-852e-4c58-9eee-3daecbc6a5a1 sde 8:64 1 14.9G disk └─sde1 8:65 1 14.9G part ext4 / d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318d在上面的例子中有一個分區之前是ext2文件系統。bcache將通過以下指令自動構建:
# make-bcache -B /dev/sdc /dev/sdd -C /dev/sda2因為設備/dev/sdc和/dev/sdd標識了bcache文件系統,因此會在系統啟動時自動添加,而/dev/sda2則需要手動添加。在/dev/sda2偏移1024處仍殘留有之前文件系統的超級塊信息,而bcache信息是從4096偏移開始記錄,修復的方法是:
# dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sda2在系統重啟之後所有磁盤被正確識別:
# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUID NAME MAJ:MIN RM SIZE TYPE FSTYPE MOUNTPOINT UUID PARTUUID sda 8:0 0 111.8G disk ├─sda1 8:1 0 3G part vfat /esp 7E67-C0BB d39828e8-4880-4c85-9ec0-4255777aa35b └─sda2 8:2 0 108.8G part bcache 93d22899-cd86-4815-b6d0-d72006201e75 baf812f4-9b80-42c4-b7ac-5ed0ed19be65 ├─bcache0 254:0 0 2.7T disk btrfs 2ff19aaf-852e-4c58-9eee-3daecbc6a5a1 └─bcache1 254:1 0 2.7T disk btrfs 2ff19aaf-852e-4c58-9eee-3daecbc6a5a1 sdb 8:16 0 931.5G disk └─sdb1 8:17 0 931.5G part ntfs FAD2B75FD2B71EB7 90c80e9d-f31a-41b4-9d4d-9b02029402b2 sdc 8:32 0 2.7T disk bcache 4bd63488-e1d7-4858-8c70-a35a5ba2c452 └─bcache1 254:1 0 2.7T disk btrfs 2ff19aaf-852e-4c58-9eee-3daecbc6a5a1 sdd 8:48 0 2.7T disk bcache ce6de517-7538-45d6-b8c4-8546f13f76c1 └─bcache0 254:0 0 2.7T disk btrfs 2ff19aaf-852e-4c58-9eee-3daecbc6a5a1 sde 8:64 1 14.9G disk └─sde1 8:65 1 14.9G part ext4 / d07321b2-b67d-4daf-8022-f3307b605430 5d0a4d76-115f-4081-91ed-fb09aa2318dd同樣地,殘留超級塊還會引起類似的其他錯誤。