您现在的位置： Linux教程網 >> UnixLinux > >> Linux基礎 >> 關於Linux

Linux3.5內核對路由子系統的重構對Redirect路由以及neighbour子系統的影響

幾年前，我記得寫過好幾篇關於Linux去除對路由cache支持的文章，路由cache的下課來源於一次對路由子系統的重構，具體原因就不再重復說了，本文將介紹這次重構對Redirect路由以及neighbour子系統的影響。

事實上，直到最近3個月我才發現這些影響是如此之大，工作細節不便詳述，這裡只是對關於開放源代碼Linux內核協議棧的一些實現上的知識進行一個匯總，以便今後查閱，如果有誰也因此獲益，則不勝榮幸。

路由項rtable，dst_entry與neighbour

IP協議棧中，IP發送由兩部分組成：

IP路由的查找

要想成功發送一個數據包，必須要有響應的路由，這部分是由IP協議規范的路由查找邏輯完成的，路由查找細節並不是本文的要點，對於Linux系統，最終的查找結果是一個rtable結構體對象，表示一個路由項，其內嵌的第一個字段是一個dst_entry結構體，因此二者可以相互強制轉換，其中重要的字段就是：rt_gateway
　　rt_gateway只是要想把數據包發往目的地，下一跳的IP地址，這是IP逐跳轉發的核心。到此為止，IP路由查找就結束了。

IP neighbour的解析

在IP路由查找階段已經知道了rt_gateway，那麼接下來就要往二層落實了，這就是IP neighbour解析的工作，我們知道rt_gateway就是neighbour，現在需要將它解析成硬件地址。所謂的neighbour就是邏輯上與本機直連的所有網卡設備，“邏輯上直連”意味著，對於以太網而言，整個以太網上所有的設備都可以是本機的鄰居，關鍵看誰被選擇為發送當前包的下一跳，而對於POINTOPOINT設備而言，則其鄰居只有唯一的一個，即對端設備，唯一意味著不需要解析硬件地址！值得注意的是，無視這個區別將會帶來巨大的性能損失，這個我將在本文的最後說明。

聲明：

為了描述方便，以下將不再提起rtable，將路由查找結果一律用dst_entry代替！下面的代碼並不是實際上的Linux協議棧的代碼，而是為了表述方便抽象而成的偽代碼，因此dst_entry並不是內核中的dst_entry結構體，而只是代表一個路由項！這麼做的理由是，dst_entry表示的是與協議無關的部分，本文的內容也是與具體協議無關的，因此在偽代碼中不再使用協議相關的rtable結構體表示路由項。

Linux內核對路由子系統的重構

在Linux內核3.5版本之前，路由子系統存在一個路由cache哈希表，它緩存了最近最經常使用的一些dst_entry(IPv4即rtable)路由項，對數據包首先以其IP地址元組信息查找路由cache，如果命中便可以直接取出dst_entry，否則再去查找系統路由表。
　　在3.5內核中，路由cache不見了，具體緣由不是本文的重點，已有其它文章描述，路由cache的去除引起了對neighbour子系統的副作用，這個副作用被證明是有益的，下面的很大的篇幅都花在這個方面，在詳細描述重構對neighbour子系統的影響之前，再簡單說說另一個變化，就是Redirect路由的實現的變化。
　　所謂的Redirect路由肯定是對本機已經存在的路由項的Redirect，然而在早期的內核中，都是在不同的位置比如inet_peer中保存重定向路由，這意味著路由子系統與協議棧其它部分發生了耦合。在早期內核中，其實不管Redirect路由項存在於哪裡，最終它都要進入路由cache才能起作用，可是在路由cache完全沒有了之後，Redirect路由保存的位置問題才暴露出來，為了“在路由子系統內部解決Redirect路由問題”，重構後的內核在路由表中為每一個路由項保存了一個exception哈希表，一個路由項Fib_info類似於下面的樣子：

Fib_info {
　　Address nexhop;
　　Hash_list exception;
};

這個exception表的表項類似下面的樣子：

Exception_entry {
　　Match_info info;
　　Address new_nexthop;
};

這樣的話，當收到Reidrect路由的時候，會初始化一個Exception_entry記錄並且插入到相應的exception哈希表，在查詢路由的時候，比如說最終找到了一個Fib_info，在構建最終的dst_entry之前，要先用諸如源IP信息之類的Match_info去查找exception哈希表，如果找到一個匹配的Exception_entry，則不再使用Fib_info中的nexhop構建dst_entry，而是使用找到的Exception_entry中的new_nexthop來構建dst_entry。
在對Redirect路由進行了簡單的介紹之後，下面的篇幅將全部用於介紹路由與neighbour的關系。

重構對neighbour子系統的副作用

以下是網上摘錄的關於在路由cache移除之後對neighbour的影響：
Neighbours
>Hold link-level nexthop information (for ARP, etc.)
>Routing cache pre-computed neighbours
>Remember: One “route” can refer to several nexthops
>Need to disconnect neighbours from route entries.
>Solution:
　　Make neighbour lookups cheaper (faster hash, etc.)
　　Compute neighbours at packet send time ...
　　.. instead of using precomputed reference via route
>Most of work involved removing dependenies on old setup
事實上二者不該有關聯的，路由子系統和neighbour子系統是兩個處在上下不同層次的子系統，合理的方式是通過路由項的nexthop值來承上啟下，通過一個唯一的neighbour查找接口關聯即可：

dst_entry = 路由表查找(或者路由cache查找，通過skb的destination作鍵值)
nexthop = dst_entry.nexthop
neigh = neighbour表查找(通過nexthop作為鍵值)

然而Linux協議棧的實現卻遠遠比這更復雜，這一切還得從3.5內核重構前開始說起。

重構前

在重構前，由於存在路由cache，凡是在cache中可以找到dst_entry的skb，便不用再查找路由表，路由cache存在的假設是，對於絕大多數的skb，都不需要查找路由表，理想情況下，都可以在路由cache中命中。對於neighbour而言，顯而易見的做法是將neighbour和dst_entry做綁定，在cache中找到了dst_entry，也就一起找到了neighbour。也就是說，路由cache不僅僅緩存dst_entry，還緩存neighbour。
　　事實上在3.5內核前，dst_entry結構體中有一個字段就是neighbour，表示與該路由項綁定的neighour，從路由cache中找到路由項後，直接取出neighbour就可以直接調用其output回調函數了。
　　我們可以推導出dst_entry與neighbour的綁定時期，那就是查找路由表之後，即在路由cache未命中時，進而查找路由表完成後，將結果插入到路由cache之前，執行一個neighbour綁定的邏輯。
　　和路由cache一樣，neighbour子系統也維護著一張neighbour表，並執行著替換，更新，過期等狀態操作，這個neighbour表和路由cache表之間存在著巨大的耦合，在描述這些耦合前，我們先看一下整體的邏輯：

func ip_output(skb):
        dst_entry = lookup_from_cache(skb.destination);
        if dst_entry == NULL
        then
                dst_entry = lookup_fib(skb.destination);
                nexthop = dst_entry.gateway?:skb.destination;
                neigh = lookup(neighbour_table, nexthop);
                if neigh == NULL
                then
                        neigh = create(neighbour_table, nexthop);
                        neighbour_add_timer(neigh);
                end
                dst_entry.neighbour = neigh;
                insert_into_route_cache(dst_entry);
        end
        neigh = dst_entry.neighbour;
        neigh.output(neigh, skb);
endfunc
---->TO Layer2

試看以下幾個問題：
如果neighbour定時器執行時，某個neighbour過期了，可以刪除嗎？
如果路由cache定時器執行時，某條路由cache過期了，可以刪除嗎？
如果可以精確回答上述兩個問題，便對路由子系統和neighbour子系統之間的關系足夠了解了。我們先看第一個問題。
　　如果刪除了neighbour，由於此時與該neighbour綁定的路由cache項可能還在，那麼在後續的skb匹配到該路由cache項時，便無法取出和使用neighbour，由於dst_entry和neighbour的綁定僅僅發生在路由cache未命中的時候，此時無法執行重新綁定，事實上，由於路由項和neighbour是一個多對一的關系，因此neighbour中無法反向引用路由cache項，通過dst_entry.neighbour引用的一個刪除後的neighbour就是一個野指針從而引發oops最終內核panic。因此，顯而易見的答案就是即便neighbour過期了，也不能刪除，只能標記為無效，這個通過引用計數可以做到。現在看第二個問題。
　　路由cache過期了，可以刪除，但是要記得遞減與該路由cache項綁定的neighbour的引用計數，如果它為0，把neighbour刪除，這個neighbour就是第一個問題中在neighbour過期時無法刪除的那類neighbour。由此我們可以看到，路由cache和neighbour之間的耦合關系導致與一個dst_entry綁定的neighbour的過期刪除操作只能從路由cache項發起，除非一個neighbour沒有同任何一個dst_entry綁定。現修改整體的發送邏輯如下：

func ip_output(skb):
        dst_entry = lookup_from_cache(skb.destination);
        if dst_entry == NULL
        then
                dst_entry = lookup_fib(skb.destination);
                nexthop = dst_entry.gateway?:skb.destination;
                neigh = lookup(neighbour_table, nexthop);
                if neigh == NULL
                then
                        neigh = create(neighbour_table, nexthop);
                        neighbour_add_timer(neigh);
                end
                inc(neigh.refcnt);
                dst_entry.neighbour = neigh;
                insert_into_route_cache(dst_entry);
        end
        neigh = dst_entry.neighbour;
        # 如果是INVALID狀態的neigh，需要在output回調中處理
        neigh.output(neigh, skb);
endfunc
   
func neighbour_add_timer(neigh):
        inc(neigh.refcnt);
        neigh.timer.func = neighbour_timeout;
        timer_start(neigh.timer);
endfunc

func neighbour_timeout(neigh):
        cnt = dec(neigh.refcnt);
        if cnt == 0
        then
                free_neigh(neigh);
        else
                neigh.status = INVALID;
        end
endfunc

func dst_entry_timeout(dst_entry):
        neigh = dst_entry.neighbour;
        cnt = dec(neigh.refcnt);
        if cnt == 0
        then
                free_neigh(neigh);
        end
        free_dst(dst_entry);
endfunc

我們最後看看這會帶來什麼問題。
　　如果neighbour表的gc參數和路由cache表的gc參數不同步，比如neighbour過快到期，而路由cache項到期的很慢，則會有很多的neighbour無法刪除，造成neighbour表爆滿，因此在這種情況下，需要強制回收路由cache，這是neighbour子系統反饋到路由子系統的一個耦合，這一切簡直太亂了：

func create(neighbour_table, nexthop):
retry:
        neigh = alloc_neigh(nexthop);
        if neigh == NULL or neighbour_table.num > MAX
        then
                shrink_route_cache();
                retry;
        end
endfunc

關於路由cache的gc定時器與neighbour子系統的關系，有一篇寫得很好的關於路由cache的文章《Tuning Linux IPv4 route cache》如下所述：
You may find documentation about those obsolete sysctl values:
net.ipv4.route.secret_interval has been removed in Linux 2.6.35; it was used to trigger an asynchronous flush at fixed interval to avoid to fill the cache.
net.ipv4.route.gc_interval has been removed in Linux 2.6.38. It is still present until Linux 3.2 but has no effect. It was used to trigger an asynchronous cleanup of the route cache. The garbage collector is now considered efficient enough for the job.
UPDATED: net.ipv4.route.gc_interval is back for Linux 3.2. It is still needed to avoid exhausting the neighbour cache because it allows to cleanup the cache periodically and not only above a given threshold. Keep it to its default value of 60.

這一切在3.5內核之後發生了改變！！

重構後

經過了重構，3.5以及此後的內核去除了對路由cache的支持，也就是說針對每一個數據包都要去查詢路由表(暫不考慮在socket緩存dst_entry的情形)，不存在路由cache也就意味著不需要處理cache的過期和替換問題，整個路由子系統成了一個完全無狀態的系統，因此，dst_entry再也無需和neighbour綁定了，既然每次都要重新查找路由表開銷也不大，每次查找少得多的neighbour表的開銷更是可以忽略(雖然查表開銷無法避免)，因此dst_entry去除了neighbour字段，IP發送邏輯如下：

func ip_output(skb):
        dst_entry = lookup_fib(skb.destination);
        nexthop = dst_entry.gateway?:skb.destination;
        neigh = lookup(neighbour_table, nexthop);
        if neigh == NULL
        then    
                neigh = create(neighbour_table, nexthop);
        end
        neigh.output(skb);
endfunc

路由項不再和neighbour關聯，因此neighbour表就可以獨立執行過期操作了，neighbour表由於路由cache的gc過慢而導致頻繁爆滿的情況也就消失了。
　　不光如此，代碼看上去也清爽了很多。

一個細節：關於POINTOPOINT和LOOPBACK設備的neighbour

有很多講述Linux neighbour子系統的資料，但是幾乎無一例外都是在說ARP的，各種復雜的ARP協議操作，隊列操作，狀態機等，但是幾乎沒有描述ARP之外的關於neighbour的資料，因此本文在最後這個小節中准備補充關於這方面的一個例子。還是從問題開始：
一個NOARP的設備，比如POINTOPOINT設備發出的skb，其neighbour是誰？
在廣播式以太網情況下，要發數據包到遠端，需要解析“下一跳”地址，即每一個發出的數據包都要經由一個gateway發出去，這個gateway被抽象為一個同網段的IP地址，因此需要用ARP協議落實到確定的硬件地址。但是對於pointopoint設備而言，與該設備對連的只有固定的一個，它並沒有一個廣播或者多播的二層，因此也就沒有gateway的概念了，或者換句話說，其下一跳就是目標IP地址本身。
　　根據上述的ip_output函數來看，在查找neighbour表之前，使用的鍵值是nexthop，對於pointopoint設備而言，nexthop就是skb的目標地址本身，如果找不到將會以此為鍵值進行創建，那麼試想使用pointopint設備發送的skb的目標地址空間十分海量的情況，將會有海量的neighbour在同一時間被創建，這些neighbour將會同時插入到neighbour表中，而這必然要遭遇到鎖的問題，事實上，它們的插入操作將全部自旋在neighbour表讀寫鎖的寫鎖上！！
　　neigh_create的邏輯如下：

struct neighbour *neigh_create(struct neigh_table *tbl, const void *pkey,
                   struct net_device *dev)
{
    struct neighbour *n1, *rc, *n = neigh_alloc(tbl);
　　......
    write_lock_bh(&tbl->lock);
　　// 插入hash表
    write_unlock_bh(&tbl->lock);
    .......
}

在海量目標IP的skb通過pointopoint設備發送的時候，這是一個完全避不開的瓶頸！然而內核沒有這麼傻。它采用了以下的方式進行了規避：

__be32 nexthop = ((struct rtable *)dst)->rt_gateway?:ip_hdr(skb)->daddr;
if (dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT))
　　nexthop = 0;

這就意味著只要發送的pointopint設備相同，且偽二層(比如IPGRE的情況)信息相同，所有的skb將使用同一個neighbour，不管它們的目標地址是否相同。在IPIP Tunnel的情形下，由於這種設備沒有任何的二層信息，這更是意味著所有的通過IPIP Tunnel設備的skb將使用一個單一的neighbour，即便是使用不同的IPIP Tunnel設備進行發送。
但是在3.5內核重構之後，悲劇了！
　　我們直接看4.4的內核吧！

static inline __be32 rt_nexthop(const struct rtable *rt, __be32 daddr)
{
    if (rt->rt_gateway)
        return rt->rt_gateway;
    return daddr;
}
static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
{
　　......
    nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
    neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
    if (unlikely(!neigh))
        neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
    if (!IS_ERR(neigh)) {
        int res = dst_neigh_output(dst, neigh, skb);
        return res;
    }
　　......
}

可以看到，dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT)這個判斷消失了！這意味著內核變傻了。上一段中分析的那種現象在3.5之後的內核中將會發生，事實上也一定會發生。
　　遭遇這個問題後，在沒有詳細看3.5之前的內核實現之前，我的想法是初始化一個全局的dummy neighbour，它就是簡單的使用dev_queue_xmit進行direct out：

static const struct neigh_ops dummy_direct_ops = {
    .family =        AF_INET,
    .output =        neigh_direct_output,
    .connected_output =    neigh_direct_output,
};
struct neighbour dummy_neigh;
void dummy_neigh_init()
{
    memset(&dummy_neigh, 0, sizeof(dummy_neigh));
    dummy_neigh.nud_state = NUD_NOARP;
    dummy_neigh.ops = &dummy_direct_ops;
    dummy_neigh.output = neigh_direct_output;
    dummy_neigh.hh.hh_len = 0;
}

static inline int ip_finish_output2(struct sk_buff *skb)
 {
　　......
     nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
    if (dev->type == ARPHRD_TUNNEL) {
        neigh = &dummy_neigh;
    } else {
        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
    }
     if (unlikely(!neigh))
         neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
　　......
 }

後來看了3.5內核之前的實現，發現了：

if (dev->flags&(IFF_LOOPBACK|IFF_POINTOPOINT))
　　nexthop = 0;

於是決定采用這個，代碼更少也更優雅！然後就產生了下面的patch：

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s

        rcu_read_lock_bh();
        nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
+       if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
+               nexthop = 0;
        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
        if (unlikely(!neigh))
                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

上一篇文章：玩轉Linux命令之文件管理與編輯
下一篇文章：玩轉Linux命令之系統管理與維護

關於Linux