Shaohua Li [Mon, 4 Aug 2008 06:51:30 +0000 (14:51 +0800)]
reduce tlb/cache flush times of agpgart memory allocation
To reduce tlb/cache flush, makes agp memory allocation do one flush
after all pages in a region are changed to uc.
All agp drivers except agp-sgi uses agp_generic_alloc_page()
for .agp_alloc_page, so the patch should work for them. agp-sgi is only
for ia64, so not a problem too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: airlied@linux.ie Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Shaohua Li [Mon, 4 Aug 2008 06:51:24 +0000 (14:51 +0800)]
introduce two APIs for page attribute
Introduce two APIs for page attribute. flushing tlb/cache in every page
attribute is expensive. AGP gart usually will do a lot of operations to
change a page to uc, new APIs can reduce flush.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: airlied@linux.ie Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tim Bird [Tue, 12 Aug 2008 19:52:36 +0000 (12:52 -0700)]
x86, bootup: add built-in kernel command line for x86 (v2)
Allow x86 to support a built-in kernel command line. The built-in
command line can override the one provided by the boot loader, for
those cases where the boot loader is broken or it is difficult
to change the command line in the the boot loader.
H. Peter Anvin wrote:
> Ingo Molnar wrote:
>> Best would be to make it really apparent in the code that nothing
>> changes if this config option is not set. Preferably there should be
>> no extra code at all in that case.
>>
>
> I would like to see this:
[...Nested ifdefs...]
OK. This version changes absolutely nothing if CONFIG_CMDLINE_BOOL is not
set (the default). Also, no space is appended even when CONFIG_CMDLINE_BOOL
is set, but the builtin string is empty. This is less sloppy all the way
around, IMHO.
Note that I use the same option names as on other arches for
this feature.
[ mingo@elte.hu: build fix ]
Signed-off-by: Tim Bird <tim.bird@am.sony.com> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
rcu: classic RCU locking and memory-barrier cleanups
This patch simplifies the locking and memory-barrier usage in the Classic
RCU grace-period-detection mechanism, incorporating Lai Jiangshan's
feedback from the earlier version (http://lkml.org/lkml/2008/8/1/400
and http://lkml.org/lkml/2008/8/3/43). Passed 10 hours of
rcutorture concurrent with CPUs being put online and taken offline on
a 128-hardware-thread Power machine. My apologies to whoever in the
Eastern Hemisphere was planning to use this machine over the Western
Hemisphere night, but it was sitting idle and...
So this is ready for tip/core/rcu.
This patch is in preparation for moving to a hierarchical
algorithm to allow the very large SMP machines -- requested by some
people at OLS, and there seem to have been a few recent patches in the
4096-CPU direction as well. The general idea is to move to a much more
conservative concurrency design, then apply a hierarchy to reduce
contention on the global lock by a few orders of magnitude (larger
machines would see greater reductions). The reason for taking a
conservative approach is that this code isn't on any fast path.
Prototype in progress.
This patch is against the linux-tip git tree (tip/core/rcu). If you
wish to test this against 2.6.26, use the following set of patches:
Hugh Dickins [Fri, 15 Aug 2008 12:58:32 +0000 (13:58 +0100)]
x86: fix /proc/meminfo DirectMap
Do we actually want these DirectMap lines in the x86 /proc/meminfo?
I can see they're interesting to CPA developers and TLB optimizers,
but they don't fit its usual "where has all my memory gone?" usage.
If they are to stay, here are some fixes.
1. On x86_32 without PAE, they're not 2M but 4M pages: no need to
mess with the internal enum, but show the right name to users.
2. Many machines can never show anything but 0 for DirectMap1G,
so suppress that line unless direct_gbpages are really enabled.
3. The unit in /proc/meminfo is kB not number of pages: HugePages
messed that up, but they're an example to regret not to follow.
4. Once we use kB, it's easy to see that 1GB has gone missing (which
explains why CONFIG_CPA_DEBUG=y soon wraps DirectMap2M negative):
because head_64.S's level2_ident_pgt entries were not counted.
My fix is not ideal, but works for more and for less than 1G,
and avoids interfering with early bootup pagetable contortions.
Paul E. McKenney [Wed, 13 Aug 2008 00:25:03 +0000 (17:25 -0700)]
rcu: prevent console flood when one CPU sees another AWOL via RCU
One small change needed to keep from flooding the console when one
CPU notices that another is AWOL. Unless I am missing something subtle.
Otherwise the cleanups look good!
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86: fix readb() et al compile error with gcc-3.2.3
Building 2.6.27-rc1 on x86 with gcc-3.2.3 fails with:
In file included from include/asm/dma.h:12,
from include/linux/bootmem.h:8,
from init/main.c:26:
include/asm/io.h: In function `readb':
include/asm/io.h:32: syntax error before string constant
include/asm/io.h: In function `readw':
include/asm/io.h:33: syntax error before string constant
include/asm/io.h: In function `readl':
include/asm/io.h:34: syntax error before string constant
include/asm/io.h: In function `__readb':
include/asm/io.h:36: syntax error before string constant
include/asm/io.h: In function `__readw':
include/asm/io.h:37: syntax error before string constant
include/asm/io.h: In function `__readl':
include/asm/io.h:38: syntax error before string constant
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2
Starting with 2.6.27-rc1 readb() et al are generated by a
build_mmio_read() macro, which generates asm() statements with
output register constraints like "=" "q", i.e. as two adjacent
string literals. This doesn't work with gcc-3.2.3.
Fixed by moving the "=" part into the callers' reg parameter
(as suggested by Ingo).
Build and boot-tested with gcc-3.2.3 on 32 and 64-bit x86.
Mark Langsdorf [Thu, 14 Aug 2008 14:11:26 +0000 (09:11 -0500)]
x86: invalidate caches before going into suspend
When a CPU core is shut down, all of its caches need to be flushed
to prevent stale data from causing errors if the core is resumed.
Current Linux suspend code performs an assignment after the flush,
which can add dirty data back to the cache. On some AMD platforms,
additional speculative reads have caused crashes on resume because
of this dirty data.
Relocate the cache flush to be the very last thing done before
halting. Tie into an assembly line so the compile will not
reorder it. Add some documentation explaining what is going
on and why we're doing this.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com> Acked-by: Mark Borden <mark.borden@amd.com> Acked-by: Michael Hohmuth <michael.hohmuth@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Aristeu Rozanski [Thu, 14 Aug 2008 20:32:15 +0000 (16:32 -0400)]
x86, perfctr: don't use CCCR_OVF_PMI1 on Pentium 4Ds
Currently, setup_p4_watchdog() use CCCR_OVF_PMI1 to enable the counter
overflow interrupts to the second logical core. But this bit doesn't work
on Pentium 4 Ds (model 4, stepping 4) and this patch avoids its use on
these processors. Tested on 4 different machines that have this
specific model with success.
Joerg Roedel [Thu, 14 Aug 2008 17:55:18 +0000 (19:55 +0200)]
x86, AMD IOMMU: initialize dma_ops after sysfs registration
If sysfs registration fails all memory used by IOMMU is freed. This
happens after dma_ops initialization and the functions will access the
freed memory then.
Fix this by initializing dma_ops after the sysfs registration.
Dave Jones [Thu, 14 Aug 2008 19:07:03 +0000 (15:07 -0400)]
x86: silence mmconfig printk
There's so much broken mmconfig hardware/bios'es out there,
that classing this as an error seems a little extreme.
Lower its priority to KERN_INFO so that it isn't so noisy
when booting with 'quiet'
Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cyrill Gorcunov [Fri, 15 Aug 2008 11:51:21 +0000 (13:51 +0200)]
x86: apic - unify __setup_APIC_LVTT
To be able to unify this function we RE-introduce
APIC_DIVISOR for 64bit mode. This snipped was eliminated
in some time ago in a sake of clenup but now we need it
again since it allow up to get rid of #ifdef(s).
And lapic_is_integrated call is added in apic_64.c but
since we always have APIC integrated on 64bit cpu compiler
will ignore this call.
Darrick J. Wong [Thu, 14 Aug 2008 22:43:33 +0000 (15:43 -0700)]
x86, msr: fix NULL pointer deref due to msr_open on nonexistent CPUs
msr_open tests for someone trying to open a device for a nonexistent CPU.
However, the function always returns 0, not ret like it should, hence
userspace can BUG the kernel trivially. This bug was introduced by the
cdev lock_kernel pushdown patch last May.
The BUG can be reproduced with these commands:
# mknod fubar c 202 8 <-- pick a number less than NR_CPUS that is not
the number of an online CPU
# cat fubar
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Takashi Iwai [Wed, 13 Aug 2008 13:40:53 +0000 (15:40 +0200)]
ALSA: usb-audio - Add ignore_ctl_error parameter
Added the ignore_ctl_error parameter to enable/disable the control-error
handling for mixer interfaces. It was a hard-coded ifdef, and now you
can change it more easily.
Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jaroslav Kysela <perex@perex.cz>
Zhao Yakui [Mon, 11 Aug 2008 02:33:31 +0000 (10:33 +0800)]
ACPI: Avoid bogus EC timeout when EC is in Polling mode
When EC is in Polling mode, OS will check the EC status continually by using
the following source code:
clear_bit(EC_FLAGS_WAIT_GPE, &ec->flags);
while (time_before(jiffies, delay)) {
if (acpi_ec_check_status(ec, event))
return 0;
msleep(1);
}
But msleep is realized by the function of schedule_timeout. At the same time
although one process is already waken up by some events, it won't be scheduled
immediately. So maybe there exists the following phenomena:
a. The current jiffies is already after the predefined jiffies.
But before timeout happens, OS has no chance to check the EC
status again.
b. If preemptible schedule is enabled, maybe preempt schedule will happen
before checking loop. When the process is resumed again, maybe
timeout already happens, which means that OS has no chance to check
the EC status.
In such case maybe EC status is already what OS expects when timeout happens.
But OS has no chance to check the EC status and regards it as AE_TIME.
So it will be more appropriate that OS will try to check the EC status again
when timeout happens. If the EC status is what we expect, it won't be regarded
as timeout. Only when the EC status is not what we expect, it will be regarded
as timeout, which means that EC controller can't give a response in time.
Zhao Yakui [Tue, 12 Aug 2008 02:40:10 +0000 (10:40 +0800)]
ACPI : Add the EC dmi table to fix the incorrect ECDT table
On some ASUS laptops the ECDT gives the incorrect command/status & Data I/O
register address.
AK: it seems like the command/data addresses are exchanged.
In such case it will cause that EC device can't be
initialized correctly.
To add the EC dmi table is to fix this issue. If the laptop falls into the
EC dmi table, the EC command/data I/O address will be fixed.
Holger Macht [Wed, 6 Aug 2008 15:56:01 +0000 (17:56 +0200)]
ACPI: Properly clear flags on false-positives and send uevent on sudden unplug
Some devices emit a ACPI_NOTIFY_DEVICE_CHECK while physically unplugging
even if the software undock has already been done and dock_present() check
fails. However, the internal flags need to be cleared (complete_undock()).
Also, even notify userspace if the dock station suddently went away
without proper software undocking.
Carlos Corbacho [Wed, 6 Aug 2008 18:13:56 +0000 (19:13 +0100)]
acer-wmi: Fix wireless and bluetooth on early AMW0 v2 laptops
In the old acer_acpi, I discovered that on some of the newer AMW0 laptops
that supported the WMID methods, they don't work properly for setting the
wireless and bluetooth values.
So for the AMW0 V2 laptops, we want to use both the 'old' AMW0 and the
'new' WMID methods for setting wireless & bluetooth to guarantee we always
enable it.
This was fixed in acer_acpi some time ago, but I forgot to port the patch
over to acer-wmi when it was merged.
(Without this patch, early AMW0 V2 laptops such as the Aspire 5040 won't
work with acer-wmi, where-as they did with the old acer_acpi).
AK: fix compilation
Signed-off-by: Carlos Corbacho <carlos@strangeworlds.co.uk> CC: stable@kernel.org Signed-off-by: Andi Kleen <ak@linux.intel.com>
Bob Moore [Mon, 4 Aug 2008 03:13:01 +0000 (11:13 +0800)]
ACPICA: Additional error checking for pathname utilities
Add error check after all calls to acpi_ns_get_pathname_length.
Add status return from acpi_ns_build_external_path and check after
all calls. Add parameter validation to acpi_ut_initialize_buffer.
Reported by and initial patch by Ingo Molnar.
http://lkml.org/lkml/2008/7/21/176
Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Bob Moore [Fri, 4 Jul 2008 02:41:41 +0000 (10:41 +0800)]
ACPICA: Fix memory leak when deleting thermal/processor objects
Fixes a possible memory leak when thermal and processor objects
are deleted. Any associated notify handlers (and objects) were
not being deleted. Fiodor Suietov. BZ 506
Signed-off-by: Fiodor Suietov <fiodor.f.suietov@intel.com> Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Jarek Poplawski [Fri, 15 Aug 2008 00:01:10 +0000 (17:01 -0700)]
pkt_sched: Fix unlocking in tc_ctl_tfilter()
Fix a bug with spin_lock_bh() inserted instead of spin_unlock_bh() by
some recent patch.
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sven Wegener [Wed, 13 Aug 2008 22:47:16 +0000 (00:47 +0200)]
ipvs: Create init functions for estimator code
Commit 8ab19ea36c5c5340ff598e4d15fc084eb65671dc ("ipvs: Fix possible deadlock
in estimator code") fixed a deadlock condition, but that condition can only
happen during unload of IPVS, because during normal operation there is at least
our global stats structure in the estimator list. The mod_timer() and
del_timer_sync() calls are actually initialization and cleanup code in
disguise. Let's make it explicit and move them to their own init and cleanup
function.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>
Sven Wegener [Mon, 11 Aug 2008 19:36:06 +0000 (19:36 +0000)]
ipvs: Only call init_service, update_service and done_service for schedulers if defined
There are schedulers that only schedule based on data available in the service
or destination structures and they don't need any persistent storage or
initialization routine. These schedulers currently provide dummy functions for
the init_service, update_service and/or done_service functions. For the
init_service and done_service cases we already have code that only calls these
functions, if the scheduler provides them. Do the same for the update_service
case and remove the dummy functions from all schedulers.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>
Brian Haley [Thu, 14 Aug 2008 22:33:21 +0000 (15:33 -0700)]
netns: Add network namespace argument to rt6_fill_node() and ipv6_dev_get_saddr()
ipv6_dev_get_saddr() blindly de-references dst_dev to get the network
namespace, but some callers might pass NULL. Change callers to pass a
namespace pointer instead.
Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 14 Aug 2008 22:30:14 +0000 (15:30 -0700)]
bnx2: Reinsert VLAN tag when necessary.
In certain cases when ASF or other management firmware is running, the
chip may be configured to always strip out the VLAN tag even when
VLAN acceleration is not enabled. This causes some VLAN tagged
packets to be received by the host stack without any knowledge that
the original packet was VLAN tagged.
We fix this by re-inserting the VLAN tag into the packet when necessary.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 14 Aug 2008 22:29:39 +0000 (15:29 -0700)]
bnx2: Use proper CONFIG_VLAN_8021Q to compile the VLAN code.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 14 Aug 2008 22:29:09 +0000 (15:29 -0700)]
bnx2: Fix logic to setup VLAN rx tagging.
We should now be checking BNX2_FLAG_CAN_KEEP_VLAN to determine how
to set the VLAN rx tagging in the RX_MODE register.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: Benjamin Li <benli@broadcom.com> Signed-off-by: Matt Carlson <mcarlson@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Zijlstra [Thu, 14 Aug 2008 13:49:00 +0000 (15:49 +0200)]
sched: fix rt-bandwidth hotplug race
When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be
detatched. Alex observed the case where a runqueue was stealing bandwidth
from an already disabled runqueue to satisfy its own needs.
Stop this by skipping over already disabled runqueues.
Reported-by: Alex Nixon <alex.nixon@citrix.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Alex Nixon <alex.nixon@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sanjeev Premi [Wed, 13 Aug 2008 16:52:13 +0000 (22:22 +0530)]
Updates to omap3_evm_defconfig.
This patch updates the defconfig to 2.6.27-rc2 tag.
The updates are based on omap_3430sdp_defconfig.
Fixes the build issue reported earlier:
LD init/built-in.o
LD .tmp_vmlinux1
arm-none-linux-gnueabi-ld: no machine record defined
arm-none-linux-gnueabi-ld: no machine record defined
make: *** [.tmp_vmlinux1] Error 1
Signed-off-by: Sanjeev Premi <premi@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com>
David Howells [Thu, 14 Aug 2008 10:37:28 +0000 (11:37 +0100)]
security: Fix setting of PF_SUPERPRIV by __capable()
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.
__capable() is using neither atomic ops nor locking to protect t->flags. This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.
This patch further splits security_ptrace() in two:
(1) security_ptrace_may_access(). This passes judgement on whether one
process may access another only (PTRACE_MODE_ATTACH for ptrace() and
PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
current is the parent.
(2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
and takes only a pointer to the parent process. current is the child.
In Smack and commoncap, this uses has_capability() to determine whether
the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
This does not set PF_SUPERPRIV.
Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().
Of the places that were using __capable():
(1) The OOM killer calls __capable() thrice when weighing the killability of a
process. All of these now use has_capability().
(2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
whether the parent was allowed to trace any process. As mentioned above,
these have been split. For PTRACE_ATTACH and /proc, capable() is now
used, and for PTRACE_TRACEME, has_capability() is used.
(3) cap_safe_nice() only ever saw current, so now uses capable().
(4) smack_setprocattr() rejected accesses to tasks other than current just
after calling __capable(), so the order of these two tests have been
switched and capable() is used instead.
(5) In smack_file_send_sigiotask(), we need to allow privileged processes to
receive SIGIO on files they're manipulating.
(6) In smack_task_wait(), we let a process wait for a privileged process,
whether or not the process doing the waiting is privileged.
I've tested this with the LTP SELinux and syscalls testscripts.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
Thomas Gleixner [Thu, 14 Aug 2008 10:17:06 +0000 (12:17 +0200)]
x86: hpet: workaround SB700 BIOS
AMD SB700 based systems with spread spectrum enabled use a SMM based
HPET emulation to provide proper frequency setting. The SMM code is
initialized with the first HPET register access and takes some time to
complete. During this time the config register reads 0xffffffff. We
check for max. 1000 loops whether the config register reads a non
0xffffffff value to make sure that HPET is up and running before we go
further. A counting loop is safe, as the HPET access takes thousands
of CPU cycles. On non SB700 based machines this check is only done
once and has no side effects.
Based on a quirk patch from: crane cai <crane.cai@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Artem Bityutskiy [Tue, 12 Aug 2008 13:30:12 +0000 (16:30 +0300)]
UBIFS: xattr bugfixes
Xattr code has not been tested for a while and there were
serveral bugs. One of them is using wrong inode in
'ubifs_jnl_change_xattr()'. The other is a deadlock in
'ubifs_setxattr()': the i_mutex is locked in
'cap_inode_need_killpriv()' path, so deadlock happens when
'ubifs_setxattr()' tries to lock it again.
Yinghai Lu [Thu, 14 Aug 2008 09:16:30 +0000 (02:16 -0700)]
x86: check bigsmp in smp_sanity_check instead of cpu_up
clear bits for cpu nr > 8.
This allows us to boot the full range of possible CPUs that the
supported APIC model will allow. Previously we'd hang or boot up
with less than 8 CPUs.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Tested-by: Jeff Chua <jeff.chua.linux@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Max Krasnyansky [Mon, 11 Aug 2008 21:33:53 +0000 (14:33 -0700)]
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: menage@google.com Cc: a.p.zijlstra@chello.nl Cc: vegard.nossum@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
Max Krasnyansky [Mon, 11 Aug 2008 21:55:31 +0000 (14:55 -0700)]
x86: resurrect proper handling of maxcpus= kernel option (v2)
For some reason we had two parsers registered for maxcpus=. One in init/main.c
and another in arch/x86/smpboot.c. So I nuked the one in arch/x86.
Also 64-bit kernels used to handle maxcpus= as documented in
Documentation/cpu-hotplug.txt. CPUs with 'id > maxcpus' are initialized
but not booted. 32-bit version for some reason ignored them even though
all the infrastructure for booting them later is there.
In the current mainline both 64 and 32 bit versions are broken.
This patch restores the correct behaviour. I've tested x86_64 version on
4- and 8- way Core2 and 2-way Opteron based machines. Various config
combinations SMP, !SMP, CPU_HOTPLUG, !CPU_HOTPLUG.
Booted with maxcpus=1 and maxcpus=4, etc. Everything is working as expected.
So far we've received two reports from different people confirming that 32-bit
version also works fine, both on dual core laptops and 16way server machines.
[v2: This version fixes visws breakage pointed out by Ingo.]
Zhang, Yanmin [Wed, 14 Aug 2030 07:56:40 +0000 (15:56 +0800)]
sched: fix the race between walk_tg_tree and sched_create_group
With 2.6.27-rc3, I hit a kernel panic when running volanoMark on my
new x86_64 machine. I also hit it with other 2.6.27-rc kernels.
See below log.
Basically, function walk_tg_tree and sched_create_group have a race
between accessing and initiating tg->children. Below patch fixes it
by moving tg->children initiation to the front of linking tg->siblings
to parent->children.
Suresh Siddha [Wed, 13 Aug 2008 18:38:14 +0000 (11:38 -0700)]
x86, xsave: clear the user buffer before doing fxsave/xsave
fxsave/xsave instructions will not touch all the bytes in the
fxsave/xsave frame. Clear the user buffer before doing fxsave/xsave
directly to user buffer during the sigcontext setup.
This is essentially needed in the context of xsave(for example,
some of the fields in the xsave header are not touched by the xsave
and defined as must be zero).
This will also present uniform and clean context to the user (from
which user can safely do fxrstor/xrstor).
David S. Miller [Thu, 14 Aug 2008 08:45:41 +0000 (01:45 -0700)]
sparc64: Fix cmdline_memory_size handling bugs.
First, lmb_enforce_memory_limit() interprets it's argument
(mostly, heh) as a size limit not an address limit. So pass
the raw cmdline_memory_size value into it. And we don't
need to check it against zero, lmb_enforce_memory_limit() does
that for us.
Next, free_initmem() needs special handling when the kernel
command line trims the available memory. The problem case is
if the trimmed out memory is where the kernel image itself
resides.
When that memory is trimmed out, we don't add those physical
ram areas to the sparsemem active ranges, amongst other things.
Which means that this free_initmem() code will free up invalid
page structs, resulting in either crashes or hangs.
Just quick fix this by not freeing initmem at all if "mem="
was given on the boot command line.
Signed-off-by: David S. Miller <davem@davemloft.net>