/*e4gle:在我修改Linux源代碼的過程中曾被大量的內核互斥現象所困擾,這需要利用內核鎖去解決,雖然最後大部分解決,但我覺得應該留下些什麼,也沒時間寫了,偶爾看見這位兄弟的文章,覺得正是我想整理的,所以拿出來給大家分享,關於bottom_half和中斷的問題,在tcp/ip半底中絕對不能對文件讀寫操作,不然就panic,恰恰我在linux中的增強功能就有這個操作,使我郁悶了很久,歡迎大家討論 */ 內核中的互斥之我見 by wheelz 看了前面各位的討論,我也有些想法,與大家商榷。 需要澄清的是,互斥手段的選擇,不是根據臨界區的大小,而是根據臨界區的性質,以及 有哪些部分的代碼,即哪些內核執行路徑來爭奪。 從嚴格意義上說,semaphore和spinlock_XXX屬於不同層次的互斥手段,前者的 實現有賴於後者,這有點象HTTP和TCP的關系,都是協議,但層次是不同的。 先說semaphore,它是進程級的,用於多個進程之間對資源的互斥,雖然也是在 內核中,但是該內核執行路徑是以進程的身份,代表進程來爭奪資源的。如果 競爭不上,會有context switch,進程可以去sleep,但CPU不會停,會接著運行 其他的執行路徑。從概念上說,這和單CPU或多CPU沒有直接的關系,只是在 semaphore本身的實現上,為了保證semaphore結構存取的原子性,在多CPU中需要spinlock來互斥。 在內核中,更多的是要保持內核各個執行路徑之間的數據訪問互斥,這是最基本的互斥問題,即保持數據修改的原子性。semaphore的實現,也要依賴這個。在單CPU中,主要是中斷和bottom_half的問題,因此,開關中斷就可以了。在多CPU中,又加上了其他CPU的干擾,因此需要spinlock來幫助。這兩個部分結合起來,就形成了spinlock_XXX。它的特點是,一旦CPU進入了spinlock_XXX,它就不會干別的,而是一直空轉,直到鎖定成功為止。因此,這就決定了被spinlock_XXX鎖住的臨界區不能停,更不能context switch,要存取完數據後趕快出來,以便其他的在空轉的執行路徑能夠獲得spinlock。這也是spinlock的原則所在。如果當前執行路徑一定要進行context switch,那就要在schedule()之前釋放spinlock,否則,容易死鎖。因為在中斷和bh中,沒有context,無法進行context switch,只能空轉等待spinlock,你context switch走了,誰知道猴年馬月才能回來。 因為spinlock的原意和目的就是保證數據修改的原子性,因此也沒有理由在spinlock 鎖住的臨界區中停留。 spinlock_XXX有很多形式,有 spin_lock()/spin_unlock(), spin_lock_irq()/spin_unlock_irq(), spin_lock_irqsave/spin_unlock_irqrestore() spin_lock_bh()/spin_unlock_bh() local_irq_disable/local_irq_enable local_bh_disable/local_bh_enable 那麼,在什麼情況下具體用哪個呢?這要看是在什麼內核執行路徑中,以及要與哪些內核執行路徑相互斥。我們知道,內核中的執行路徑主要有: 1 用戶進程的內核態,此時有進程context,主要是代表進程在執行系統調用 等。 2 中斷或者異常或者自陷等,從概念上說,此時沒有進程context,不能進行 context switch。 3 bottom_half,從概念上說,此時也沒有進程context。 4 同時,相同的執行路徑還可能在其他的CPU上運行。 這樣,考慮這四個方面的因素,通過判斷我們要互斥的數據會被這四個因素中 的哪幾個來存取,就可以決定具體使用哪種形式的spinlock。如果只要和其他CPU互斥,就要用spin_lock/spin_unlock,如果要和irq及其他CPU互斥,就要用 spin_lock_irq/spin_unlock_irq,如果既要和irq及其他CPU互斥,又要保存EFLAG的狀態,就要用spin_lock_irqsave/spin_unlock_irqrestore,如果要和bh及其他CPU互斥,就要用spin_lock_bh/spin_unlock_bh,如果不需要和其他CPU互斥,只要和irq互斥,則用local_irq_disable/local_irq_enable, 如果不需要和其他CPU互斥,只要和bh互斥,則用local_bh_disable/local_bh_enable, 等等。值得指出的是,對同一個數據的互斥,在不同的內核執行路徑中, 所用的形式有可能不同(見下面的例子)。 舉一個例子。在中斷部分中有一個irq_desc_t類型的結構數組變量irq_desc[], 該數組每個成員對應一個irq的描述結構,裡面有該irq的響應函數等。 在irq_desc_t結構中有一個spinlock,用來保證存取(修改)的互斥。 對於具體一個irq成員,irq_desc[irq],對其存取的內核執行路徑有兩個,一是 在設置該irq的響應函數時(setup_irq),這通常發生在module的初始化階段,或 系統的初始化階段;二是在中斷響應函數中(do_IRQ)。代碼如下: int setup_irq(unsigned int irq, strUCt irqaction * new) { int shared = 0; unsigned long flags; struct irqaction *old, **p; irq_desc_t *desc = irq_desc + irq; /* * Some drivers like serial.c use request_irq() heavily, * so we have to be careful not to interfere with a * running system. */ if (new->flags & SA_SAMPLE_RANDOM) { /* * This function might sleep, we want to call it first, * outside of the atomic block. * Yes, this might clear the entropy pool if the wrong * driver is attempted to be loaded, without actually * installing a new handler, but is this really a problem, * only the sysadmin is able to do this. */ rand_initialize_irq(irq); } /* * The following block of code has to be executed atomically */ [1] spin_lock_irqsave(&desc->lock,flags); p = &desc->action; if ((old = *p) != NULL) { /* Can't share interrupts unless both agree to */ if (!(old->flags & new->flags & SA_SHIRQ)) { [2] spin_unlock_irqrestore(&desc->lock,flags); return -EBUSY; } /* add new interrupt at end of irq queue */ do { p = &old->next; old = *p; } while (old); shared = 1; } *p = new; if (!shared) { desc->depth = 0; desc->status &= ~(IRQ_DISABLED IRQ_AUTODETECT IRQ_WAITING); desc->handler->startup(irq); } [3] spin_unlock_irqrestore(&desc->lock,flags); register_irq_proc(irq); return 0; } asmlinkage unsigned int do_IRQ(struct pt_regs regs) { /* * We ack quickly, we don't want the irq controller * thinking we're snobs just because some other CPU has * disabled global interrupts (we have already done the * INT_ACK cycles, it's too late to try to pretend to the * controller that we aren't taking the interrupt). * * 0 return value means that this irq is already being * handled by some other CPU. (or is disabled) */ int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_ code */ int cpu = smp_processor_id(); irq_desc_t *desc = irq_desc + irq; struct irqaction * action; unsigned int status; kstat.irqs[cpu][irq]++; [4] spin_lock(&desc->lock); desc->handler->ack(irq); /* REPLAY is when Linux resends an IRQ that was dropped earlier WAITING is used by probe to mark irqs that are being tested */ status = desc->status & ~(IRQ_REPLAY IRQ_WAITING); status = IRQ_PENDING; /* we _want_ to handle it */ /* * If the IRQ is disabled for whatever reason, we cannot * use the action we have. */ action = NULL; if (!(status & (IRQ_DISABLED IRQ_INPROGRESS))) { action = desc->action; status &= ~IRQ_PENDING; /* we commit to handling */ status = IRQ_INPROGRESS; /* we are handling it */ } desc->status = status; /* * If there is no IRQ handler or it was disabled, exit early. Since we set PENDING, if another processor is handling a different instance of this same irq, the other processor will take care of it. */ if (!action) goto out; /* * Edge triggered interrupts need to remember * pending events. * This applies to any hw interrupts that allow a second * instance of the same irq to arrive while we are in do_IRQ * or in the handler. But the code here only handles the _second_ * instance of the irq, not the third or fourth. So it is mostly * useful for irq hardware that does not mask cleanly in an * SMP environment. */ for (;;) { [5] spin_unlock(&desc->lock); handle_IRQ_event(irq, ®s, action); [6] spin_lock(&desc->lock)