Peter Zijlstra [Fri, 27 Jun 2008 11:41:29 +0000 (13:41 +0200)]
sched: fix load scaling in group balancing
doing the load balance will change cfs_rq->load.weight (that's the whole point)
but since that's part of the scale factor, we'll scale back with a different
amount.
Weight getting smaller would result in an inflated moved_load which causes
it to stop balancing too soon.
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:28 +0000 (13:41 +0200)]
sched: hierarchical load vs find_busiest_group
find_busiest_group() has some assumptions about task weight being in the
NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this
by replacing it with the average task weight, which will adapt the situation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:26 +0000 (13:41 +0200)]
sched: persistent average load per task
Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of
cpu_avg_load_per_task() - this is useful because of the hierarchical group
model in which task weight can be much smaller.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:23 +0000 (13:41 +0200)]
sched: simplify the group load balancer
While thinking about the previous patch - I realized that using per domain
aggregate load values in load_balance_fair() is wrong. We should use the
load value for that CPU.
By not needing per domain hierarchical load values we don't need to store
per domain aggregate shares, which greatly simplifies all the math.
It basically falls apart in two separate computations:
- per domain update of the shares
- per CPU update of the hierarchical load
Also get rid of the move_group_shares() stuff - just re-compute the shares
again after a successful load balance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:21 +0000 (13:41 +0200)]
sched: dont micro manage share losses
We used to try and contain the loss of 'shares' by playing arithmetic
games. Replace that by noticing that at the top sched_domain we'll
always have the full weight in shares to distribute.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
The idea was to balance groups until we've reached the global goal, however
Vatsa rightly pointed out that we might never reach that goal this way -
hence take out this logic.
[ the initial rationale for this 'feature' was to promote max concurrency
within a group - it does not however affect fairness ]
Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:19 +0000 (13:41 +0200)]
sched: update aggregate when holding the RQs
It was observed that in __update_group_shares_cpu()
rq_weight > aggregate()->rq_weight
This is caused by forks/wakeups in between the initial aggregate pass and
locking of the RQs for load balance. To avoid this situation partially re-do
the aggregation once we have the RQs locked (which avoids new tasks from
appearing).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:18 +0000 (13:41 +0200)]
sched: fix sched_domain aggregation
Keeping the aggregate on the first cpu of the sched domain has two problems:
- it could collide between different sched domains on different cpus
- it could slow things down because of the remote accesses
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:12 +0000 (13:41 +0200)]
sched: fix calc_delta_asym()
calc_delta_asym() is supposed to do the same as calc_delta_fair() except
linearly shrink the result for negative nice processes - this causes them
to have a smaller preemption threshold so that they are more easily preempted.
The problem is that for task groups se->load.weight is the per cpu share of
the actual task group weight; take that into account.
Also provide a debug switch to disable the asymmetry (which I still don't
like - but it does greatly benefit some workloads)
This would explain the interactivity issues reported against group scheduling.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Fri, 27 Jun 2008 11:41:10 +0000 (13:41 +0200)]
sched: clean up some unused variables
In file included from /mnt/build/linux-2.6/kernel/sched.c:1496:
/mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime':
/mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd'
/mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity':
/mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue'
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Fri, 27 Jun 2008 09:31:28 +0000 (11:31 +0200)]
x86, AMD IOMMU: build fix #3
fix typo causing:
arch/x86/kernel/built-in.o: In function `__unmap_single':
amd_iommu.c:(.text+0x17771): undefined reference to `iommu_area_free'
arch/x86/kernel/built-in.o: In function `__map_single':
amd_iommu.c:(.text+0x1797a): undefined reference to `iommu_area_alloc'
amd_iommu.c:(.text+0x179a2): undefined reference to `iommu_area_alloc'
Ingo Molnar [Fri, 27 Jun 2008 08:48:16 +0000 (10:48 +0200)]
x86, AMD IOMMU, build fix #2
fix:
arch/x86/kernel/amd_iommu.c: In function ‘amd_iommu_init_dma_ops':
arch/x86/kernel/amd_iommu.c:940: error: lvalue required as left operand of assignment
arch/x86/kernel/amd_iommu.c:941: error: lvalue required as left operand of assignment
Ingo Molnar [Fri, 27 Jun 2008 08:37:03 +0000 (10:37 +0200)]
x86, AMD IOMMU: build fix
fix:
arch/x86/kernel/amd_iommu_init.c:247: warning: 'struct acpi_table_header' declared inside parameter list
arch/x86/kernel/amd_iommu_init.c:247: warning: its scope is only this definition or declaration, which is probably not what you want
arch/x86/kernel/amd_iommu_init.c: In function 'find_last_devid_acpi':
arch/x86/kernel/amd_iommu_init.c:257: error: dereferencing pointer to incomplete type
arch/x86/kernel/amd_iommu_init.c:265: error: dereferencing pointer to incomplete type
I discovered that we had a list onto which every lock_dlm
lock was being put. Its only function was to discover whether
we'd got any locks left after umount. Since there was already
a counter for that purpose as well, I removed the list. The
saving is sizeof(struct list_head) per glock - well worth
having.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are several reasons why this is undesirable:
1. It never happens during normal operation anyway
2. If it does happen it causes performance to be very, very poor
3. It isn't likely to solve the original problem (memory shortage
on remote DLM node) it was supposed to solve
4. It uses a bunch of arbitrary constants which are unlikely to be
correct for any particular situation and for which the tuning seems
to be a black art.
5. In an N node cluster, only 1/N of the dropped locked will actually
contribute to solving the problem on average.
So all in all we are better off without it. This also makes merging
the lock_dlm module into GFS2 a bit easier.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Bob Peterson [Mon, 9 Jun 2008 17:08:23 +0000 (12:08 -0500)]
[GFS2] kernel panic mounting volume
This patch fixes Red Hat bugzilla bug 450156.
This started with a not-too-improbable mount failure because the
locking protocol was never set back to its proper "lock_dlm" after the
system was rebooted in the middle of a gfs2_fsck. That left a
(purposely) invalid locking protocol in the superblock, which caused an
error when the file system was mounted the next time.
When there's an error mounting, vfs calls DQUOT_OFF, which calls
vfs_quota_off which calls gfs2_sync_fs. Next, gfs2_sync_fs calls
gfs2_log_flush passing s_fs_info. But due to the error, s_fs_info
had been previously set to NULL, and so we have the kernel oops.
My solution in this patch is to test for the NULL value before passing
it. I tested this patch and it fixes the problem.
Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The previous attempt to fix the locking in readpage failed due
to the use of a "try lock" which resulted in occasional high
cpu usage during testing (due to repeated tries) and also it
did not resolve all the ordering problems wrt the transaction
lock (although it did solve all the inode lock ordering problems).
This patch avoids the problem by unlocking the page and getting the
locks in the correct order. This means that we have to retest the
page to ensure that it hasn't changed when we relock the page.
This now passes the tests which were previously failing.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch merges the lock_nolock module into GFS2 itself. As well as removing
some of the overhead of the module, it also means that its now impossible to
build GFS2 without a lock module (which would be a pointless thing to do
anyway).
We also plan to merge lock_dlm into GFS2 in the future, but that is a more
tricky task, and will therefore be a separate patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: David Teigland <teigland@redhat.com>
This looks like a lot of change, but in fact its not. Mostly its
things moving from one file to another. The change is just that
instead of queuing lock completions and callbacks from the DLM
we now pass them directly to GFS2.
This gives us a net loss of two list heads per glock (a fair
saving in memory) plus a reduction in the latency of delivering
the messages to GFS2, plus we now have one thread fewer as well.
There was a bug where callbacks and completions could be delivered
in the wrong order due to this unnecessary queuing which is fixed
by this patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com>
This patch implements a number of cleanups to the core of the
GFS2 glock code. As a result a lot of code is removed. It looks
like a really big change, but actually a large part of this patch
is either removing or moving existing code.
There are some new bits too though, such as the new run_queue()
function which is considerably streamlined. Highlights of this
patch include:
o Fixes a cluster coherency bug during SH -> EX lock conversions
o Removes the "glmutex" code in favour of a single bit lock
o Removes the ->go_xmote_bh() for inodes since it was duplicating
->go_lock()
o We now only use the ->lm_lock() function for both locks and
unlocks (i.e. unlock is a lock with target mode LM_ST_UNLOCKED)
o The fast path is considerably shortly, giving performance gains
especially with lock_nolock
o The glock_workqueue is now used for all the callbacks from the DLM
which allows us to simplify the lock_dlm module (see following patch)
o The way is now open to make further changes such as eliminating the two
threads (gfs2_glockd and gfs2_scand) in favour of a more efficient
scheme.
This patch has undergone extensive testing with various test suites
so it should be pretty stable by now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Bob Peterson <rpeterso@redhat.com>
Joerg Roedel [Thu, 26 Jun 2008 19:28:07 +0000 (21:28 +0200)]
x86, AMD IOMMU: initialize dma_ops from IOMMU initialization and enable IOMMUs
This patch adds a call to the driver specific dma_ops initialization routine
from the AMD IOMMU hardware initialization. Further it adds the necessary code
to finally enable AMD IOMMU hardware.
Joerg Roedel [Thu, 26 Jun 2008 19:28:04 +0000 (21:28 +0200)]
x86, AMD IOMMU: add pre-allocation of protection domains
This patch adds a function to pre-allocate protection domains. So we don't have
to allocate it on the first request for a device (which can happen in atomic
mode).
Joerg Roedel [Thu, 26 Jun 2008 19:27:41 +0000 (21:27 +0200)]
x86, AMD IOMMU: add functions to find last possible PCI device for IOMMU
This patch adds functions to find the last PCI bus/device/function the IOMMU
driver has to handle. This information is used later to allocate the memory for
the data structures.
Tejun Heo [Sat, 21 Jun 2008 07:07:32 +0000 (16:07 +0900)]
sata_uli: hardreset is broken
sata_uli can't do hardresets reliably and lock up. This went
unnoticed till now as softreset was the default and hardreset was only
used after softreset failed.
Reported by Christian Casteyde in bz#10860.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Takashi Iwai [Wed, 25 Jun 2008 15:17:00 +0000 (17:17 +0200)]
ALSA: ymfpci - fix initial volume for 44.1kHz output
YDSXGR_BUF441OUTVOL register isn't initialized properly. This may lead to
a silent output at full volume of unchanged "Wave Playback Volume".
http://bugzilla.kernel.org/show_bug.cgi?id=10963
Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jaroslav Kysela <perex@perex.cz>
Andi Kleen [Wed, 18 Jun 2008 11:58:36 +0000 (13:58 +0200)]
[netdrvr] Fix IOMMU overflow checking in s2io.c
s2io has IOMMU overflow checking, but unfortunately it is wrong.
It didn't use the standard macros, which meant that it only worked
on POWER and SPARC because only those define DMA_ERROR_CODE. Convert it to
use the standard macros instead.
I also commented two more bugs in the IOMMU handling. It assumes
that 0 DMA addresses cannot happen, but that's not true in all IOMMU setups.
The information if a buffer has been already mapped needs to be stored
elsewhere.
Didn't fix those because it needs careful checking of the buffer handling
by the maintainers.
Andy Gospodarek [Thu, 19 Jun 2008 21:19:02 +0000 (17:19 -0400)]
e1000: only enable TSO6 via ethtool when using correct hardware
When enabling TSO via ethool on e1000, it is possible to set
NETIF_F_TSO6 on hardware that does not support it. Setting TSO via
ethtool now matches the settings used when the hardware is probed.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Al Viro [Mon, 23 Jun 2008 01:04:50 +0000 (02:04 +0100)]
[netdrvr] netxen: fix netxen_pci_tbl[] breakage
PCI_DEVICE_CLASS sets .device and .vendor to PCI_ANY_DEV,
which overrides the effect of preceding PCI_DEVICE() and makes
all elements of netxen_pci_tbl[] identical. Introduced in the
commit dcd56fdbaeae1008044687b973c4a3e852e8a726.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Ingo Molnar [Mon, 23 Jun 2008 08:41:23 +0000 (10:41 +0200)]
[netdrvr] 3c59x: remove irqs_disabled warning from local_bh_enable
Original Author: Michael Buesch <mb@bu3sch.de>
net, vortex: fix lockup
Ingo Molnar reported:
-tip testing found that Johannes Berg's "softirq: remove irqs_disabled
warning from local_bh_enable" enhancement to lockdep triggers a new
warning on an old testbox that uses 3c59x vortex and netlogging:
----->
calling vortex_init+0x0/0xb0
PCI: Found IRQ 10 for device 0000:00:0b.0
PCI: Sharing IRQ 10 with 0000:00:0a.0
PCI: Sharing IRQ 10 with 0000:00:0b.1
3c59x: Donald Becker and others.
0000:00:0b.0: 3Com PCI 3c556 Laptop Tornado at e0800400.
PCI: Enabling bus mastering for device 0000:00:0b.0
initcall vortex_init+0x0/0xb0 returned 0 after 47 msecs
...
calling init_netconsole+0x0/0x1b0
netconsole: local port 4444
netconsole: local IP 10.0.1.9
netconsole: interface eth0
netconsole: remote port 4444
netconsole: remote IP 10.0.1.16
netconsole: remote ethernet address 00:19:xx:xx:xx:xx
netconsole: device eth0 not up yet, forcing it
eth0: setting half-duplex.
eth0: setting full-duplex.
------------[ cut here ]------------
WARNING: at kernel/softirq.c:137 local_bh_enable_ip+0xd1/0xe0()
Pid: 1, comm: swapper Not tainted 2.6.26-rc6-tip #2091
[<c0125ecf>] warn_on_slowpath+0x4f/0x70
[<c0126834>] ? release_console_sem+0x1b4/0x1d0
[<c0126d00>] ? vprintk+0x2a0/0x450
[<c012fde5>] ? __mod_timer+0xa5/0xc0
[<c046f7fd>] ? mdio_sync+0x3d/0x50
[<c0160ef6>] ? marker_probe_cb+0x46/0xa0
[<c0126ed7>] ? printk+0x27/0x50
[<c046f4c3>] ? vortex_set_duplex+0x43/0xc0
[<c046f521>] ? vortex_set_duplex+0xa1/0xc0
[<c0471b92>] ? vortex_timer+0xe2/0x3e0
[<c012b361>] local_bh_enable_ip+0xd1/0xe0
[<c08d9f9f>] _spin_unlock_bh+0x2f/0x40
[<c0471b92>] vortex_timer+0xe2/0x3e0
[<c014743b>] ? trace_hardirqs_on+0xb/0x10
[<c0147358>] ? trace_hardirqs_on_caller+0x88/0x160
[<c012f8b2>] run_timer_softirq+0x162/0x1c0
[<c0471ab0>] ? vortex_timer+0x0/0x3e0
[<c012b361>] local_bh_enable_ip+0xd1/0xe0
[<c08d9f9f>] _spin_unlock_bh+0x2f/0x40
[<c0471b92>] vortex_timer+0xe2/0x3e0
[<c014743b>] ? trace_hardirqs_on+0xb/0x10
[<c0147358>] ? trace_hardirqs_on_caller+0x88/0x160
[<c012f8b2>] run_timer_softirq+0x162/0x1c0
[<c0471ab0>] ? vortex_timer+0x0/0x3e0
[<c0471ab0>] ? vortex_timer+0x0/0x3e0
[<c012b60a>] __do_softirq+0x9a/0x160
[<c012b570>] ? __do_softirq+0x0/0x160
[<c0106775>] call_on_stack+0x15/0x30
[<c012b4f5>] ? irq_exit+0x55/0x60
[<c0106e85>] ? do_IRQ+0x85/0xd0
[<c0147391>] ? trace_hardirqs_on_caller+0xc1/0x160
[<c0104888>] ? common_interrupt+0x28/0x30
[<c08d8ac8>] ? mutex_unlock+0x8/0x10
[<c08d8180>] ? _cond_resched+0x10/0x30
[<c07a3be7>] ? netpoll_setup+0x117/0x390
[<c0cbfcfe>] ? init_netconsole+0x14e/0x1b0
[<c013d539>] ? ktime_get+0x19/0x40
[<c0c9bab2>] ? kernel_init+0x1b2/0x2c0
[<c0cbfbb0>] ? init_netconsole+0x0/0x1b0
[<c0396aa4>] ? trace_hardirqs_on_thunk+0xc/0x10
[<c0103f12>] ? restore_nocheck_notrace+0x0/0xe
[<c0c9b900>] ? kernel_init+0x0/0x2c0
[<c0c9b900>] ? kernel_init+0x0/0x2c0
[<c0104aa7>] ? kernel_thread_helper+0x7/0x10
=======================
---[ end trace 37f9c502aff112e0 ]---
console [netcon0] enabled
netconsole: network logging started
initcall init_netconsole+0x0/0x1b0 returned 0 after 2914 msecs
looking at the driver I think the bug is real and the fix actually
is trivial.
vp->lock is also taken in hardware IRQ context, so we _have_ to always
use irqsafe locking. As we run in a timer with IRQs disabled,
we can simply use spin_lock.
Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Atsushi Nemoto [Thu, 26 Jun 2008 08:14:15 +0000 (17:14 +0900)]
tc35815: Fix receiver hangup on Rx FIFO overflow
On Rx FIFO overflow error, the controller consume a buffer descriptor
but currently the driver does not give it back to the controller.
This results unrecoverable 'Buffer List Exhausted' condition. This
patch fix this problem by moving a "fbl_count--" line to proper place.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Michal Schmidt [Thu, 26 Jun 2008 14:06:19 +0000 (16:06 +0200)]
s2io: fix documentation about intr_type
The documentation for intr_type module parameter of the s2io driver is
not consistent with the code. The comments in drivers/net/s2io.c are
OK, but Documentation/networking/s2io.txt is wrong.
Pointed out by Andrew Hecox.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Michael Buesch [Thu, 19 Jun 2008 23:01:50 +0000 (01:01 +0200)]
b43: Remove "shm" and "ucode_regs" debugfs files
We don't need these two dump-files anymore, as we can easily do this
in userspace now.
Use b43-fwdump from the b43-tools repository to dump microcode registers.
Use "b43-fwdump -s" to dump SHM (or use -S to do a binary dump)
Signed-off-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Michael Buesch [Thu, 19 Jun 2008 17:38:32 +0000 (19:38 +0200)]
b43: Add mask/set capability to debugfs MMIO interface
This adds an atomic mask/set capability to the debugfs MMIO interface.
This is needed to support mask and/or set operations from the userspace
debugging tools.
Signed-off-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: John W. Linville <linville@tuxdriver.com>