您现在的位置： Linux教程網 >> UnixLinux > >> Linux基礎 >> Linux技術

第12章正則表達式與文件格式化處理

標簽（空格分隔）：鳥哥的linux私房菜
第12章正則表達式與文件格式化處理
1 什麼是正則表達式
什麼是正則表達式
正則表達式的用途
2 基礎正則表達式
grep
基礎正則表達式練習
sed工具
3 擴展正則表達式
4 文件的格式化與相關處理
格式化打印 printf
awk 好用的數據處理工具
文件比較工具

12.1 什麼是正則表達式

Regular Expression,RE

什麼是正則表達式

簡單地說，正則表達式就是處理字符串的方法，它是以行為單位來進行字符串的處理行為，正則表達式通過一些特殊符號的輔助，可以讓用戶輕易達到查找、刪除、替換某些特定字符串的處理程序。
正則表達式基本上是一種“表示法”，只要工具程序支持這種表示法，那麼該工具程序就可以用來作為正則表達式的字符串處理之用。比如說：grep，sed，vi，awk BUT ls，cp等命令只能支持bash本身的通配符

正則表達式的用途

　系統每天會產生很多信息，數據量太大，系統管理員用正則表達式需要取出我們需要的信息
正則表達式和通配符是完全不一樣的東西

12.2 基礎正則表達式

grep

取出滿足條件的行
-A （after）匹配行的後n行也列出
-B （before）匹配行的前n行也列出
-n 顯示行號
-v 反選
-i 忽略大小寫

基礎正則表達式練習

練習的大前提是：
語系已經使用

export LANG=C

的設置；
grep 已經使用 alias 設置成為

grep --color=auto

wget wget該命令獲取鳥哥的練習文本' target='_blank'>http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt[/code]該命令獲取鳥哥的練習文本 
利用中括號[]來查找集合字符
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n 't[ae]st' regular_express.txt 
8:I can't finish the test.
9:Oh! The soup taste good.

查找 ‘tast’, ‘test’
現在我們需要 ‘oo’但不要’goo’
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n '[^g]oo' regular_express.txt 
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!

19行滿足條件的是 ‘ooo’而不是’goo’
現在我們oo前不想要有小寫字母
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n '[^a-z]oo' regular_express.txt 
3:Football game is not use feet only.

當我們在一組集合字符串中，如果字符組是連續的，例如大寫字母，小寫字母，數字等，我們可以使用[A-Z],[a-z],[0-9]等方式來書寫。
不同語系，字符的順序可以略有不同，也可以用以下方式取得前面的連續編碼 
'[^[:lower]oo]'
 [[:digit]]

行首與行尾字符^$
找出開頭是the的行
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n '^the' regular_express.txt 
12:the symbol '*' is represented as start.

找出小寫字母開頭的行
[code]grep -n '^[[:lower:]]' regular_express.txt  ##or '^[a-z]'

找出不是小寫字母開頭的行
[code]grep -n '^[^[:lower:]]' regular_express.txt  ##or '^[a-z]'

注意 
　1.：[[:lower:]]內層中括號是表示一個序列字符的格式，外層中括號是表示使用括號的若干個字符中的任意一個（即字符集合符號），缺一不可。 
　２.＾符號在外層中括號外是表示行首，在外層中括號內表示“反向選擇”
找出行尾結束為小數點(.)的行
[code]grep -n  '\.$' regular_express.txt   ## 小數點有其他意義所以需要反斜槓轉義

找出空白行
[code]grep -n  '^$' regular_express.txt

任意一個字符.與重復字符*
和通配符不同 *號不代表任意字符，而代表重復前一個RE字符0到無窮多次的意思，為組合形態 
.(小數點)代表一定有一個任意字符的意思  
需要找出g??d的字符串(小數點的用法)
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n 'g..d' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
9:Oh! The soup taste good.
16:The world <Happy> is the same with "glad".

* 的用法 
　grep -n 'o*' regular_express.txt
 會列出所有行，’o*’代表空字符或者一個o以上的字符 
　grep -n 'oo*' regular_express.txt
 會列出至少有一個o的字符 
　grep -n 'ooo*' regular_express.txt
 會列出至少有兩個o的字符
總結：找出有一個以上某字符(X)的行 
　grep -n 'XX*' filename

限定連續RE字符范圍{}
找出2個o的連續字符串,我們需要用轉義符對{}轉義
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n 'o\{2\}' regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

找出g後面接2~5個o，再接一個g的字符串
[code]hanzhou@hanzhou-VirtualBox:~/main$ grep -n 'go\{2,5\}g' regular_express.txt #o\{2,\}指2個以上o
18:google is the best tools for search keyword.

sed工具
sed本身也是一個管道命令，可以分析standard input，sed可以將數據進行替換、刪除、新增、選取特定行等功能。
[code]sed [-nefr] [動作]
參數
-n ：使用安靜模式，（沒懂）
-e ：直接命令行模式上進行sed的動作編輯’
-f ：直接將sed的動作寫在一個文件內，-f filename則可以執行filename內的sed動作
-r ：sed的動作支持的是擴展型正則表達式的語法（默認是基礎正則表達式語法）
-i ：直接修改讀取的文件內容，而不是由屏幕輸出

動作說明： [n1[,n1]]function
n1,n2 ：不見得會存在，一般代表選擇進行動作的行數，“10,20” 表示10到20行
a ：新增，在目前的下一行，增加新的一行
c ：替換，替換某些行
d ：刪除
i ：插入，在目前的上一行，增加新的一行
p ：打印，
s ：替換，例如 1，20s/old/new/g

刪除2-5行 
hanzhou@hanzhou-VirtualBox:/etc$ nl passwd | sed '2,5d'

刪除第二行 
hanzhou@hanzhou-VirtualBox:/etc$ nl passwd | sed '2d'

刪除第三行至最後一行 
hanzhou@hanzhou-VirtualBox:/etc$ nl passwd | sed '3,$d'
 $表示最後一行
在第二行後（即是加在第三行）加上”drink tea?”字樣！
[code]hanzhou@hanzhou-VirtualBox:/etc$ nl passwd | sed '2a drink tea?' ##2i就是插在第二行前
     1  root:x:0:0:root:/root:/bin/bash
     2  daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
drink tea?
     3  bin:x:2:2:bin:/bin:/usr/sbin/nologin
     4  sys:x:3:3:sys:/dev:/usr/sbin/nologin
……##後面省略

添加多行
[code]hanzhou@hanzhou-VirtualBox:~$ nl /etc/passwd | sed '2i drink tea or \
drink beer ?'
     1  root:x:0:0:root:/root:/bin/bash
drink tea or 
drink beer ?
     2  daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin

替換
[code]hanzhou@hanzhou-VirtualBox:~$ nl /etc/passwd | sed '2,5c No 2-5 number'
     1  root:x:0:0:root:/root:/bin/bash
No 2-5 number
     6  games:x:5:60:games:/usr/games:/usr/sbin/nologin
     7  man:x:6:12:man:/var/cache/man:/usr/sbin/nologin

顯示指定行(要指定-n，使用安靜模式)
[code]hanzhou@hanzhou-VirtualBox:~$ nl /etc/passwd | sed -n  '2,5p'
     2  daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
     3  bin:x:2:2:bin:/bin:/usr/sbin/nologin
     4  sys:x:3:3:sys:/dev:/usr/sbin/nologin
     5  sync:x:4:65534:sync:/bin:/bin/sync

部分數據的查找並替換的功能
固定格式：sed ‘s/old_word/new_word/g’
[code]hanzhou@hanzhou-VirtualBox:~$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
hanzhou@hanzhou-VirtualBox:~$ echo $PATH |sed 's/usr/user/g'
/user/local/sbin:/user/local/bin:/user/sbin:/user/bin:/sbin:/bin:/user/games:/user/local/games

sed可以直接修改文件內容,-i（危險操作） 
把行尾的句號變成感歎號
[code]hanzhou@hanzhou-VirtualBox:~/main$ sed -i 's/\.$/!/g' regular_express.txt

在最後一行加入’#This is a test’
[code]hanzhou@hanzhou-VirtualBox:~/main$ sed -i '$#aThis is a test' regular_express.txt

12.3 擴展正則表達式
可以用來簡化命令,比如:取出空白行與行首為#的行我們使用: 
grep -v '^$' regular_express.txt | grep -v '^#'
 
可以用簡化為: 
egrep -v '^$|^#' regular_express.txt
 
egrep 與grep -E 是類似命令別名的關系

o+ :一個或一個以上RE字符
o? :零個或一個前一個RE字符
| :用或(or)方式找出數個字符串
() :找出[群組]字串 
查找good和glad 用egrep -n 'g(oo|la)d'

()+ :多個重復群組的判別
! (感歎號)在正則表達式 當中不是特殊字符
[code]hanzhou@hanzhou-VirtualBox:~/main$ echo 'AxyzC' | egrep 'A(xyz)?C'
AxyzC
hanzhou@hanzhou-VirtualBox:~/main$ echo 'AC' | egrep 'A(xyz)?C'
AC
hanzhou@hanzhou-VirtualBox:~/main$ echo 'AxyzxyzxyzxyzC' | egrep 'A(xyz)+C'
AxyzxyzxyzxyzC

12.4 文件的格式化與相關處理
　運用一些操作,我們不需要vim去編輯文件,通過數據流重定向配合printf,awk命令,就可以控制文件的輸出格式.
格式化打印 : printf
[code]hanzhou@hanzhou-VirtualBox:~/main$ printf '%10s\t %5i\t %5i\t %5i\t %8.2f\t \n' $(cat printf_test.txt|grep -v Name)
    DmTsai      80      60      92      77.33    
     VBird      75      55      80      70.00    
       Ken      60      90      70      73.33    
hanzhou@hanzhou-VirtualBox:~/main$ cat printf_test.txt 
Name     Chinese   English   Math    Average
DmTsai        80        60     92      77.33
VBird         75        55     80      70.00
Ken           60        90     70      73.33
hanzhou@hanzhou-VirtualBox:~/main$ printf '%s\t %s\t %s\t %s\t %s\t \n' $(cat printf_test.txt)
Name     Chinese     English     Math    Average     
DmTsai   80  60  92  77.33   
VBird    75  55  80  70.00   
Ken  60  90  70  73.33   
hanzhou@hanzhou-VirtualBox:~/main$ printf '%10s\t %5i\t %5i\t %5i\t %8.2f\t \n' $(cat printf_test.txt|grep -v Name)
    DmTsai      80      60      92      77.33    
     VBird      75      55      80      70.00    
       Ken      60      90      70      73.33

列出十六進制數值45 代表的字符
[code]hanzhou@hanzhou-VirtualBox:~/main$ printf '\x45\n'
E

\x  代表十六進制
awk : 好用的數據處理工具
　相比於sed常常作用於一整行的處理,awk則比較傾向於將一行分成數個”字段”來處理.因此,awk相當適合處理小型的數據處理. 
　awk的用法
[code]awk '條件類型1{動作1} 條件類型2{動作2} …' filename

示例
[code]hanzhou@hanzhou-VirtualBox:~/main$ last -n 5 
hanzhou  pts/1        :0               Fri Apr 22 10:29   still logged in   
hanzhou  :0           :0               Fri Apr 22 10:29   still logged in   
reboot   system boot  4.2.0-34-generic Fri Apr 22 10:29 - 16:54  (06:24)    
hanzhou  pts/1        :0               Thu Apr 21 16:30 - crash  (17:58)    
hanzhou  pts/5        :0               Wed Apr 20 17:30 - 17:08  (23:38)    

wtmp begins Fri Apr  1 15:13:50 2016
hanzhou@hanzhou-VirtualBox:~/main$ last -n 5 | awk '{print $1 "\t" $3}'
hanzhou :0
hanzhou :0
reboot  boot
hanzhou :0
hanzhou :0

wtmp    Fri
hanzhou@hanzhou-VirtualBox:~/main$

awk 默認是按空格或[tab]鍵作為字段的分隔符的
1,1,3分別代表第一個,第三個地段
$0代表整個一行的數據的意思
awk的內置變量 
NF : 每一行($0)擁有的字段總數 
NR : 目前awk所處裡的是’第幾行’數據 
FS : 目前的分隔字符默認是空格鍵
[code]hanzhou@hanzhou-VirtualBox:~/main$ last -n 5 | awk '{print $1 "\t lines:" NR "\t colume: " NF}'
hanzhou  lines:1     colume: 10
hanzhou  lines:2     colume: 10
hanzhou  lines:3     colume: 10
reboot   lines:4     colume: 11
hanzhou  lines:5     colume: 10
     lines:6     colume: 0
wtmp     lines:7     colume: 7

NF,NR,FS前不需要加$符,單引號裡面,不能再用單引號,要用雙引號
awk的邏輯運算符
[code]hanzhou@hanzhou-VirtualBox:~/main$ cat /etc/passwd | awk 'begin {FS=":"} $3 < 10 {print $1} '

說明:1.先改變分隔字符,不加begin的話,第一行還是按默認的分隔字符; 
　　  2.$3 < 10,相當於sql裡的where條件 
　　  
給文本加上一個匯總列
[code]$ cat pay.txt |  awk 'NR==1{printf "%10s %10s %10s %10s %10s\n",$1,$2,$3,$4,"Total" } 
NR>=2{total = $2 + $3 + $4 
printf "%10s %10d %10d %10d %10.2f\n", $1, $2, $3, $4, total}'
      Name        1st        2nd        3th      Total
     VBird      23000      24000      25000   72000.00
    DMTsai      21000      20000      23000   64000.00
     Bird2      43000      42000      41000  126000.00

$ cat pay.txt | awk 'NR==1{printf "%10s %10s %10s %10s %10s\n",$1,$2,$3,$4,"Total" } ;NR>=2{total = $2 + $3 + $4 ;printf "%10s %10d %10d %10d %10.2f\n", $1, $2, $3, $4, total}'
      Name        1st        2nd        3th      Total
     VBird      23000      24000      25000   72000.00
    DMTsai      21000      20000      23000   64000.00
     Bird2      43000      42000      41000  126000.00

所有awk的動作,即在{}內的動作,如果有需要多個命令輔助時,可利用分號”;”間隔,或者直接以[Enter]按鍵來隔開每個命令
格式化輸出時,在printf的格式設置當中,務必加上\n,才能進行分行!
與bash、shell的變量不同,在awk中,變量可以直接使用,不需要加上$符號。
awk的動作內{}也是支持if(條件)的。舉例來說,上面的命令可以修改成為這樣:
[code]$ cat pay.txt  | awk '{if (NR==1) printf "%10s %s10s %10s %10s %10s\n",$1,$2,$3,$4,"Total" };NR > 1 {total = $2+$3+$4;printf "%10s %10d %10d %10d %10.2f\n", $1, $2, $3, $4, total}'

文件比較工具
diff 
通常是用在同一的文件(或軟件)的新舊版本區別上。
git diff
cmp ,cmp按字節比較,diff按字節比較