pilppa.org Git - linux-2.6-omap-h63xx.git/log

pkt_sched: ERR_PTR() ususally encodes an negative errno, not positive.

Note, in the following patch, 'err' is initialized as:

int err = -ENOBUFS;

Signed-off-by: WANG Cong <wcong@critical-links.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevice: Fix typo of dev_unicast_add() comment

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af_unix: fix 'poll for write'/connected DGRAM sockets

For n:1 'datagram connections' (eg /dev/log), the unix_dgram_sendmsg
routine implements a form of receiver-imposed flow control by
comparing the length of the receive queue of the 'peer socket' with
the max_ack_backlog value stored in the corresponding sock structure,
either blocking the thread which caused the send-routine to be called
or returning EAGAIN. This routine is used by both SOCK_DGRAM and
SOCK_SEQPACKET sockets. The poll-implementation for these socket types
is datagram_poll from core/datagram.c. A socket is deemed to be
writeable by this routine when the memory presently consumed by
datagrams owned by it is less than the configured socket send buffer
size. This is always wrong for PF_UNIX non-stream sockets connected to
server sockets dealing with (potentially) multiple clients if the
abovementioned receive queue is currently considered to be full.
'poll' will then return, indicating that the socket is writeable, but
a subsequent write result in EAGAIN, effectively causing an (usual)
application to 'poll for writeability by repeated send request with
O_NONBLOCK set' until it has consumed its time quantum.

The change below uses a suitably modified variant of the datagram_poll
routines for both type of PF_UNIX sockets, which tests if the
recv-queue of the peer a socket is connected to is presently
considered to be 'full' as part of the 'is this socket
writeable'-checking code. The socket being polled is additionally
put onto the peer_wait wait queue associated with its peer, because the
unix_dgram_recvmsg routine does a wake up on this queue after a
datagram was received and the 'other wakeup call' is done implicitly
as part of skb destruction, meaning, a process blocked in poll
because of a full peer receive queue could otherwise sleep forever
if no datagram owned by its socket was already sitting on this queue.
Among this change is a small (inline) helper routine named
'unix_recvq_full', which consolidates the actual testing code (in three
different places) into a single location.

Signed-off-by: Rainer Weikusat <rweikusat@mssgmbh.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix for splice receive when used with software LRO

If an skb has nr_frags set to zero but its frag_list is not empty (as
it can happen if software LRO is enabled), and a previous
tcp_read_sock has consumed the linear part of the skb, then
__skb_splice_bits:

(a) incorrectly reports an error and

(b) forgets to update the offset to account for the linear part

Any of the two problems will cause the subsequent __skb_splice_bits
call (the one that handles the frag_list skbs) to either skip data,
or, if the unadjusted offset is greater then the size of the next skb
in the frag_list, make tcp_splice_read loop forever.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: calculate tcp_mem based on low memory instead of all memory

The tcp_mem array which contains limits on the total amount of memory
used by TCP sockets is calculated based on nr_all_pages. On a 32 bits
x86 system, we should base this on the number of lowmem pages.

Signed-off-by: Miquel van Smoorenburg <miquels@cistron.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>

hamradio: remove unused variable

Signed-off-by: Andre Haupt <andre@bitwigglers.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

md: rationalize raid5 function names

From: Dan Williams <dan.j.williams@intel.com>

Commit a4456856 refactored some of the deep code paths in raid5.c into separate
functions. The names chosen at the time do not consistently indicate what is
going to happen to the stripe. So, update the names, and since a stripe is a
cache element use cache semantics like fill, dirty, and clean.

(also, fix up the indentation in fetch_block5)

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: handle operation chaining in raid5_run_ops

From: Dan Williams <dan.j.williams@intel.com>

Neil said:
> At the end of ops_run_compute5 you have:
>         /* ack now if postxor is not set to be run */
>         if (tx && !test_bit(STRIPE_OP_POSTXOR, &s->ops_run))
>                 async_tx_ack(tx);
>
> It looks odd having that test there.  Would it fit in raid5_run_ops
> better?

The intended global interpretation is that raid5_run_ops can build a chain
of xor and memcpy operations.  When MD registers the compute-xor it tells
async_tx to keep the operation handle around so that another item in the
dependency chain can be submitted. If we are just computing a block to
satisfy a read then we can terminate the chain immediately.  raid5_run_ops
gives a better context for this test since it cares about the entire chain.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: replace R5_WantPrexor with R5_WantDrain, add 'prexor' reconstruct_states

From: Dan Williams <dan.j.williams@intel.com>

Currently ops_run_biodrain and other locations have extra logic to determine
which blocks are processed in the prexor and non-prexor cases. This can be
eliminated if handle_write_operations5 flags the blocks to be processed in all
cases via R5_Wantdrain. The presence of the prexor operation is tracked in
sh->reconstruct_state.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states'

From: Dan Williams <dan.j.williams@intel.com>

Track the state of reconstruct operations (recalculating the parity block
usually due to incoming writes, or as part of array expansion) Reduces the
scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether
a reconstruct operation has been requested via the ops_request field of struct
stripe_head_state.

This is the final step in the removal of ops.{pending,ack,complete,count}, i.e.
the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do
not track the state of the operation.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: replace STRIPE_OP_COMPUTE_BLK with STRIPE_COMPUTE_RUN

From: Dan Williams <dan.j.williams@intel.com>

Track the state of compute operations (recalculating a block from all the other
blocks in a stripe) with a state flag. Reduces the scope of the
STRIPE_OP_COMPUTE_BLK flag to only tracking whether a compute operation has
been requested via the ops_request field of struct stripe_head_state.

Note, the compute operation that is performed in the course of doing a 'repair'
operation (check the parity block, recalculate it and write it back if the
check result is not zero) is tracked separately with the 'check_state'
variable. Compute operations are held off while a 'check' is in progress, and
moving this check out to handle_issuing_new_read_requests5 the helper routine
__handle_issuing_new_read_requests5 can be simplified.

This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_COMPUTE_BLK only requests an operation and does not track the
state of the operation.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: replace STRIPE_OP_BIOFILL with STRIPE_BIOFILL_RUN

From: Dan Williams <dan.j.williams@intel.com>

Track the state of read operations (copying data from the stripe cache to bio
buffers outside the lock) with a state flag. Reduce the scope of the
STRIPE_OP_BIOFILL flag to only tracking whether a biofill operation has been
requested via the ops_request field of struct stripe_head_state.

This is another step towards the removal of ops.{pending,ack,complete,count},
i.e. STRIPE_OP_BIOFILL only requests an operation and does not track the state
of the operation.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: replace STRIPE_OP_CHECK with 'check_states'

From: Dan Williams <dan.j.williams@intel.com>

The STRIPE_OP_* flags record the state of stripe operations which are
performed outside the stripe lock.  Their use in indicating which
operations need to be run is straightforward; however, interpolating what
the next state of the stripe should be based on a given combination of
these flags is not straightforward, and has led to bugs.  An easier to read
implementation with minimal degrees of freedom is needed.

Towards this goal, this patch introduces explicit states to replace what was
previously interpolated from the STRIPE_OP_* flags.  For now this only converts
the handle_parity_checks5 path, removing a user of the
ops.{pending,ack,complete,count} fields of struct stripe_operations.

This conversion also found a remaining issue with the current code.  There is
a small window for a drive to fail between when we schedule a repair and when
the parity calculation for that repair completes.  When this happens we will
writeback to 'failed_num' when we really want to write back to 'pd_idx'.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: unify raid5/6 i/o submission

From: Dan Williams <dan.j.williams@intel.com>

Let the raid6 path call ops_run_io to get pending i/o submitted.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: use stripe_head_state in ops_run_io()

From: Dan Williams <dan.j.williams@intel.com>

In handle_stripe after taking sh->lock we sample some bits into 's' (struct
stripe_head_state):

s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);

Use these values from 's' in ops_run_io() rather than re-sampling the bits.
This ensures a consistent snapshot (as seen under sh->lock) is used.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: kill STRIPE_OP_IO flag

From: Dan Williams <dan.j.williams@intel.com>

The R5_Want{Read,Write} flags already gate i/o. So, this flag is
superfluous and we can unconditionally call ops_run_io().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

md: kill STRIPE_OP_MOD_DMA in raid5 offload

From: Dan Williams <dan.j.williams@intel.com>

This micro-optimization allowed the raid code to skip a re-read of the
parity block after checking parity.  It took advantage of the fact that
xor-offload-engines have their own internal result buffer and can check
parity without writing to memory.  Remove it for the following reasons:

1/ It is a layering violation for MD to need to manage the DMA and
   non-DMA paths within async_xor_zero_sum
2/ Bad precedent to toggle the 'ops' flags outside the lock
3/ Hard to realize a performance gain as reads will not need an updated
   parity block and writes will dirty it anyways.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>

Support changing rdev size on running arrays.

From: Chris Webb <chris@arachsys.com>

Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
superblock if necessary for this metadata version. We prevent the available
space from shrinking to less than the used size, and allow it to be set to zero
to fill all the available space on the underlying device.

Signed-off-by: Chris Webb <chris@arachsys.com>
Signed-off-by: Neil Brown <neilb@suse.de>

Make sure all changes to md/dev-XX/state are notified

The important state change happens during an interrupt
in md_error. So just set a flag there and call sysfs_notify
later in process context.

Signed-off-by: Neil Brown <neilb@suse.de>

Make sure all changes to md/degraded are notified.

When a device fails, when a spare is activated, when
an array is reshaped, or when an array is started,
the extent to which the array is degraded can change.

Signed-off-by: Neil Brown <neilb@suse.de>

Make sure all changes to md/sync_action are notified.

When the 'resync' thread starts or stops, when we explicitly
set sync_action, or when we determine that there is definitely nothing
to do, we notify sync_action.

To stop "sync_action" from occasionally showing the wrong value,
we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
recovery is probably needed or happening, and we make sure
that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.

Signed-off-by: Neil Brown <neilb@suse.de>

Make sure all changes to md/array_state are notified.

Changes in md/array_state could be of interest to a monitoring
program.  So make sure all changes trigger a notification.

Exceptions:
   changing active_idle to active is not reported because it
      is frequent and not interesting.
   changing active to active_idle is only reported on arrays
      with externally managed metadata, as it is not interesting
      otherwise.

Signed-off-by: Neil Brown <neilb@suse.de>

Don't reject HOT_REMOVE_DISK request for an array that is not yet started.

There is really no need for this test here, and there are valid
cases for selectively removing devices from an array that
it not actually active.

Signed-off-by: Neil Brown <neilb@suse.de>

rationalise return value for ->hot_add_disk method.

For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown <neilb@suse.de>

Support adding a spare to a live md array with external metadata.

i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown <neilb@suse.de>

Enable setting of 'offset' and 'size' of a hot-added spare.

offset_store and rdev_size_store allow control of the region of a
device which is to be using in an md/raid array.
They only allow these values to be set when an array is being assembled,
as changing them on an active array could be dangerous.
However when adding a spare device to an array, we might need to
set the offset and size before starting recovery. So allow
these values to be set also if "->raid_disk < 0" which indicates that
the device is still a spare.

Signed-off-by: Neil Brown <neilb@suse.de>

Don't try to make md arrays dirty if that is not meaningful.

Arrays personalities such as 'raid0' and 'linear' have no redundancy,
and so marking them as 'clean' or 'dirty' is not meaningful.
So always allow write requests without requiring a superblock update.

Such arrays types are detected by ->sync_request being NULL. If it is
not possible to send a sync request we don't need a 'dirty' flag because
all a dirty flag does is trigger some sync_requests.

Signed-off-by: Neil Brown <neilb@suse.de>

Close race in md_probe

There is a possible race in md_probe. If two threads call md_probe
for the same device, then one could exit (having checked that
->gendisk exists) before the other has called kobject_init_and_add,
thus returning an incomplete kobj which will cause problems when
we try to add children to it.

So extend the range of protection of disks_mutex slightly to
avoid this possibility.

Signed-off-by: Neil Brown <neilb@suse.de>

Allow setting start point for requested check/repair

This makes it possible to just resync a small part of an array.
e.g. if a drive reports that it has questionable sectors,
a 'repair' of just the region covering those sectors will
cause them to be read and, if there is an error, re-written
with correct data.

Signed-off-by: Neil Brown <neilb@suse.de>

Improve setting of "events_cleared" for write-intent bitmaps.

When an array is degraded, bits in the write-intent bitmap are not
cleared, so that if the missing device is re-added, it can be synced
by only updated those parts of the device that have changed since
it was removed.

The enable this a 'events_cleared' value is stored. It is the event
counter for the array the last time that any bits were cleared.

Sometimes - if a device disappears from an array while it is 'clean' -
the events_cleared value gets updated incorrectly (there are subtle
ordering issues between updateing events in the main metadata and the
bitmap metadata) resulting in the missing device appearing to require
a full resync when it is re-added.

With this patch, we update events_cleared precisely when we are about
to clear a bit in the bitmap. We record events_cleared when we clear
the bit internally, and copy that to the superblock which is written
out before the bit on storage. This makes it more "obviously correct".

We also need to update events_cleared when the event_count is going
backwards (as happens on a dirty->clean transition of a non-degraded
array).

Thanks to Mike Snitzer for identifying this problem and testing early
"fixes".

Cc: "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>

use bio_endio instead of a call to bi_end_io

Turn calls to bi->bi_end_io() into bio_endio(). Apparently bio_endio does
exactly the same error processing as is hardcoded at these places.

bio_endio() avoids recursion (or will soon), so it should be used.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>

linear: correct disk numbering error check

From: "Nikanth Karthikesan" <knikanth@novell.com>

Correct disk numbering problem check.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Neil Brown <neilb@suse.de>

Fix error paths if md_probe fails.

md_probe can fail (e.g. alloc_disk could fail) without
returning an error (as it alway returns NULL).
So when we call mddev_find immediately afterwards, we need
to check that md_probe actually succeeded. This means checking
that mdev->gendisk is non-NULL.

cc: <stable@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>

Don't acknowlege that stripe-expand is complete until it really is.

We shouldn't acknowledge that a stripe has been expanded (When
reshaping a raid5 by adding a device) until the moved data has
actually been written out. However we are currently
acknowledging (by calling md_done_sync) when the POST_XOR
is complete and before the write.

So track in s.locked whether there are pending writes, and don't
call md_done_sync yet if there are.

Note: we all set R5_LOCKED on devices which are are about to
read from. This probably isn't technically necessary, but is
usually done when writing a block, and justifies the use of
s.locked here.

This bug can lead to a crash if an array is stopped while an reshape
is in progress.

Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>

Ensure interrupted recovery completed properly (v1 metadata plus bitmap)

If, while assembling an array, we find a device which is not fully
in-sync with the array, it is important to set the "fullsync" flags.
This is an exact analog to the setting of this flag in hot_add_disk
methods.

Currently, only v1.x metadata supports having devices in an array
which are not fully in-sync (it keep track of how in sync they are).
The 'fullsync' flag only makes a difference when a write-intent bitmap
is being used. In this case it tells recovery to ignore the bitmap
and recovery all blocks.

This fix is already in place for raid1, but not raid5/6 or raid10.

So without this fix, a raid1 ir raid4/5/6 array with version 1.x
metadata and a write intent bitmaps, that is stopped in the middle
of a recovery, will appear to complete the recovery instantly
after it is reassembled, but the recovery will not be correct.

If you might have an array like that, issueing
echo repair > /sys/block/mdXX/md/sync_action

will make sure recovery completes properly.

Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>

kbuild: fix a.out.h export to userspace with O= build.

We need to check for existence of the a.out.h header in the source tree,
not the object tree, if we want it to get the right answer with O=.

Signed-off-by: David Woodhouse <david.woodhouse@intel.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>

powerpc: Add dma nodes to 83xx, 85xx and 86xx boards

Added DMA nodes for the elo/elo-plus DMA engines.

Renamed the interrupt controller alias in mpc832x_rdb.dts to ipic so that
its the same as all the other boards.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>

x86: setup: issue a null command after enabling A20 via KBC

Apparently, DOS and possibly other legacy operating systems issued a
null command to the keyboard controller after toggling A20,
specifically "pulse output pins" with no output pins specified.  This
was presumably done for synchronization reasons.  This has made it
into at least the UHCI spec, and it has been found to cause
compatibility problems when "legacy USB" is enabled (which it almost
always is) to not have this byte sent.

It is *NOT* clear if any of these compatibility problems has any
effect on Linux.  However, for maximum compatibility, issue this null
command after togging A20 through the KBC.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>

PCI: remove unused arch pcibios_update_resource() functions

Russell King did the following back in 2003:

<--  snip  -->

    [PCI] pci-9: Kill per-architecture pcibios_update_resource()

    Kill pcibios_update_resource(), replacing it with pci_update_resource().
    pci_update_resource() uses pcibios_resource_to_bus() to convert a
    resource to a device BAR - the transformation should be exactly the
    same as the transformation used for the PCI bridges.

    pci_update_resource "knows" about 64-bit BARs, but doesn't attempt to
    set the high 32-bits to anything non-zero - currently no architecture
    attempts to do something different.  If anyone cares, please fix; I'm
    going to reflect current behaviour for the time being.

    Ivan pointed out the following architectures need to examine their
    pcibios_update_resource() implementation - they should make sure that
    this new implementation does the right thing.  #warning's have been
    added where appropriate.

        ia64
        mips
        mips64

    This cset also includes a fix for the problem reported by AKPM where
    64-bit arch compilers complain about the resource mask being placed
    in a u32.

<--  snip  -->

This patch removes the unused pcibios_update_resource() functions the
kernel gained since, from FRV, m68k, mips & sh architectures.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

PCI: fix pci_setup_device()'s sprinting into a const buffer

Make pci_setup_device() write the bus ID directly into the allotted storage,
rather than using pci_name() as the address as that now returns a const
pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

PCI: Fix comment of pci_dynids

struct pci_driver has no field of driver_data.
It's in pci_device_id.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

pciehp: use get_service_data

Current pciehp driver saves its private data pointer into pci_dev
structure using pci_set_drvdata()/pci_get_drvdata(). But because
pciehp is not a pci device driver but a PCI Express service driver, it
should save its private data pointer into pcie_device structure using
set_service_data()/get_service_data().

Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

pciehp: remove needless command completed interrupt setting

Currently, pciehp driver enables command completed interrupt as follows.

(1) Don't enable at initialization.
(2) Enable command completed interrupt whenever pciehp issues a
command, if the command doesn't attempt to disable the interrupt.
(3) Disable command completed interrupt at driver unloading.

Once we enable command completed interrupt, we don't need to re-enable
it for every command. So we can simplify above steps as follows:

(1) Enable command completed interrupt at initialization.
(2) No special sequence for command completed interrupt.
(3) Disable command completed interrupt at driver unloading.

Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

pciehp: fix interrupt initialization

Current pciehp driver's intialization sequence is as follows:

(1) initialize controller data structure
(2) install interrupt handler
(3) enable software notification
(4) initialize controller specific slot data structure
(5) initialize generic slot data structure and register it to pci hotplug core

The interrupt handler of pciehp assumes that controller specific slot
data structure is already initialized. However, it is installed at (2)
before initializing controller specific slot data structure at
(4). Because of this, pciehp driver cannot handle the following cases
properly.

- If devices that shares IRQ with pciehp raise interrupts between (2) and (4).
- If hotplug events (e.g. MRL open) happen between (3) and (4).

We already have a workaround for this problem ("pciehp: fix NULL
dereference in interrupt handler: dbd79aed1aea2bece0bf43cc2ff3b2f9baf48a08).
But we still need fundamental fix.

This patch fix the problem by changing the initilization sequence as follows:

(1) initialize controller data structure
(2) initialize controller specific slot data structure
(3) install interrupt handler
(4) enable software notification
(5) initialize generic slot data structure and register it to pci hotplug core

Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Acked-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

firewire: fw-sbp2: fix parsing of logical unit directories

There is a small off-by-one bug in firewire-sbp2. This causes problems
when a device exports multiple LUN Directories. I found it when trying
to talk to a SONY DVD Jukebox.

Signed-off-by: Richard Sharpe <realrichardsharpe@gmail.com>
Acked-by: Kristian Høgsberg <krh@redhat.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> (op. order, changelog)

mac80211: fix an oops in several failure paths in key allocation

This patch fixes an oops in several failure paths in key allocation. This
Oops occurs when freeing a key that has not been linked yet, so the
key->sdata is not set.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

prism: islpci_eth.c endianness fix

clock is already cpu-endian (see le32_to_cpu slightly before), so
le64_to_cpu doesn't make much sense.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

rt2x00: Fix lock dependency errror

This fixes a circular locking dependency in the workqueue handling.
The interface work task uses the mac80211 function
ieee80211_iterate_active_interfaces() which grabs the RTNL lock.

However when the interface is brough down, this happens under the RTNL
lock as well, this causes problems because mac80211 will flush the workqueue
during the ifdown event. This causes mac80211 to wait until the driver has
completed all work which can't finish because it is waiting on the RTNL lock.

This is fixed by moving rt2x00 workqueue tasks on a different workqueue,
this workqueue can be flushed when the ieee80211_hw structure is removed
by the driver (when the driver is unloaded) which does not happen under the
RTNL lock.

Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

backtrace: replace timer with tasklet + completions

On qemu, the backtrace would show up _after_ the "end of backtrace
testing" message.

This patch changes it to use completions instead, which will guarantee
that no such race exists.

Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

stacktrace: add saved stack traces to backtrace self-test

This patch adds saved stack-traces to the backtrace suite of self-tests.

Note that we don't depend on or unconditionally enable CONFIG_STACKTRACE
because not all architectures may have it (and we still want to enable the
other tests for those architectures).

Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

avr32: Kill special exception handler sections

Kill the special exception handler sections .tlbx.ex.text,
.tlbr.ex.text, tlbw.ex.text and .scall.text. Use .org instead to place
the handlers at the required offsets from EVBA.

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Kill unneeded #include <asm/pgalloc.h> from asm/mmu_context.h

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Clean up time.c #includes

Remove lots of unneeded #includes, add #include <linux/kernel.h> and
sort alphabetically.

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

x86: don't destroy %rbp on kernel-mode faults

From the code:

    "B stepping K8s sometimes report an truncated RIP for IRET exceptions
    returning to compat mode. Check for these here too."

The code then proceeds to truncate the upper 32 bits of %rbp. This means
that when do_page_fault() is finally called, its prologue,

    do_page_fault:
        push %rbp
        movl %rsp, %rbp

will put the truncated base pointer on the stack. This means that the
stack tracer will not be able to follow the base-pointer changes and
will see all subsequent stack frames as unreliable.

This patch changes the code to use a different register (%rcx) for the
checking and leaves %rbp untouched.

Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

wireless: remove RFKILL_STATE_HARD_BLOCKED warnings

  CC [M]  drivers/net/wireless/b43/rfkill.o
drivers/net/wireless/b43/rfkill.c: In function ‘b43_rfkill_soft_toggle’:
drivers/net/wireless/b43/rfkill.c:90: warning: enumeration value ‘RFKILL_STATE_HARD_BLOCKED’ not handled in switch

  CC [M]  drivers/net/wireless/b43legacy/rfkill.o
drivers/net/wireless/b43legacy/rfkill.c: In function ‘b43legacy_rfkill_soft_toggle’:
drivers/net/wireless/b43legacy/rfkill.c:92: warning: enumeration value ‘RFKILL_STATE_HARD_BLOCKED’ not handled in switch

  CC [M]  drivers/net/wireless/iwlwifi/iwl-rfkill.o
drivers/net/wireless/iwlwifi/iwl-rfkill.c: In function ‘iwl_rfkill_soft_rf_kill’:
drivers/net/wireless/iwlwifi/iwl-rfkill.c:56: warning: enumeration value ‘RFKILL_STATE_HARD_BLOCKED’ not handled in switch

Also handle RFKILL_STATE_{ON,OFF} -> RFKILL_STATE_{UNBLOCKED,SOFT_BLOCKED}
conversion since I'm already here...

Signed-off-by: John W. Linville <linville@tuxdriver.com>

atmel_pwm: Rename the "mck" clock to "pwm_clk"

The name "mck" causes a conflict on AT91. Call it "pwm_clk" instead.

Signed-off-by: Sedji Gaouaou <sedji.gaouaou@atmel.com>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

at32ap700x spi: enable pullups on MISO

This is a minor tweak to the at32ap700x pin configuration for the SPI
input pin (MISO), enabling the on-chip weak pullup (typical 190K) to

  (a) ensure a fixed data value for missing or input-only slaves;

  (b) prevent power waste associated with inputs floating near VDDIO/2.

Atmel's boards have no external pullup or pulldown on these pins, so
it's unlikely other boards would address these issues with external
pulldowns.  Were there trouble, board-specific code could turn off
the relevant pullup(s).

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: improve NGW100 I2C/PMBus setup

Basic I2C initialization for the NGW100 board:

  - Provide empty i2c device table. Daughtercards may add devices,
    and the ATtiny24 could do stuff too.

  - Set up EXTINT(3) so the ATtiny24 can interrupt the AP7000.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Add PSIF platform devices

This patch adds the PS/2 interface (PSIF) to the device code, split into
two platform devices, one for each port.

The function for adding the PSIF platform device is also added to the
board header file.

Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Add pin configuration choice to LCDC peripheral

This patch lets the board code choose which pin out to use for the LCD
interface.

On AT32AP7000 the LCDC is wired to two sets of pins, which lets the user
choose between dual ethernet and 32-bit EBI. For the ATNGW100 board it
is vital to have the choice to select the alternative pinout since this
pinout is routed to the external headers.

Update ATSTK1002 and ATSTK1004 to use the new interface.

Signed-off-by: Hans-Christian Egtvedt <hcegtvedt@atmel.com>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: minor GPIO handling updates

On the odd chance some code uses a pin as a GPIO IRQ without calling
gpio_request() or gpio_direction_input(), the debug dump should still
show its pin status.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

ath5k: remove now unused variable declared in ath5k_tx

CC [M] drivers/net/wireless/ath5k/base.o
drivers/net/wireless/ath5k/base.c: In function ‘ath5k_tx’:
drivers/net/wireless/ath5k/base.c:2598: warning: unused variable ‘info’

Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: fix tx fragmentation

This patch fixes TX fragmentation caused by
tx handlers reordering and 'tx info to cb' patches

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: make workqueue freezable

This patch makes the mac80211 workqueue freezable making it
interact a bit better with system suspend and not try to ping
the AP while the hardware is down.

This doesn't really help with implementing proper suspend in
any way but makes some bad things trigger less.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

wireless: Small cleanups

Small whitespace cleanups for wireless drivers

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

iwlwifi: fix build for CONFIG_INPUT=n

Fix iwlwifi so that it builds cleanly with CONFIG_INPUT=n.
Also free the input device on exit.

drivers/built-in.o: In function `iwl_rfkill_unregister':
(.text+0xbf430): undefined reference to `input_unregister_device'
drivers/built-in.o: In function `iwl_rfkill_init':
(.text+0xbf51c): undefined reference to `input_allocate_device'
drivers/built-in.o: In function `iwl_rfkill_init':
(.text+0xbf5bf): undefined reference to `input_register_device'
drivers/built-in.o: In function `iwl_rfkill_init':
(.text+0xbf5e9): undefined reference to `input_free_device'
net/built-in.o: In function `rfkill_disconnect':
rfkill-input.c:(.text+0xe71e1): undefined reference to `input_close_device'
rfkill-input.c:(.text+0xe71e9): undefined reference to `input_unregister_handle'
net/built-in.o: In function `rfkill_connect':
rfkill-input.c:(.text+0xe723e): undefined reference to `input_register_handle'
rfkill-input.c:(.text+0xe724d): undefined reference to `input_open_device'
rfkill-input.c:(.text+0xe725c): undefined reference to `input_unregister_handle'
net/built-in.o: In function `rfkill_handler_init':
rfkill-input.c:(.init.text+0x36ec): undefined reference to `input_register_handler'
net/built-in.o: In function `rfkill_handler_exit':
rfkill-input.c:(.exit.text+0x112c): undefined reference to `input_unregister_handler'
make[1]: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: Add RTNL warning for workqueue

The workqueue provided by mac80211 should not be used for
scheduled tasks that acquire the RTNL lock. This could be done
when the driver uses the function ieee80211_iterate_active_interfaces()
within the scheduled work. Such behavior will end in locking
dependencies problems when an interface is being removed.

This patch will add a notification about the RTNL locking and
the mac80211 workqueue to prevent driver developers from
blindly using it.

Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: add phy information to giwname

This patch add phy information to giwname.

Quoting:
It's not useless, it's supposed to tell you about the protocol
capability of the device, like "IEEE 802.11b" or "IEEE 802.11abg"

Jean

Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: update the authentication method

This patch updates the authentication method upon giwencode ioctl.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: don't return -EINVAL upon iwconfig wlan0 rts auto

This patch avoids returning -EINVAL upon iwconfig wlan0 rts auto. If
rts->fixed is 0, then we should choose a default value instead of failing.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

b43: Fix PIO skb clobber

This fixes a clobber of the skb that was introduced by the
tx_control->cb conversion patches.
This bug causes a crash when the skb destructor is invoked. That happens
on skb_orphan or skb_kfree.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

rt2x00: kill URB for all TX queues during disable_radio()

During rt2x00usb_disable_radio() all pending urb's should
be killed and not only those from the RX queue.

Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: mlme.c use new frame control helpers

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: rx.c use new frame control helpers

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: tx.c use new frame control helpers

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: wep.c use new frame control helpers

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

ath5k: convert LED code to use mac80211 triggers

This change cleans up the ath5k LED code and converts it to use
the standard LED device class along with the rx/tx LED triggers
provided by mac80211.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: Let drivers have access to TKIP key offets for TX and RX MIC

Some drivers may want to to use the TKIP key offsets for TX and RX
MIC so lets move this out. Lets also clear up a bit how this is used
internally in mac80211.

Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

rt2x00: Remove duplicate deinitialization

When rt2x00queue_alloc_rxskbs() fails rt2x00queue_unitialize()
will be called which will free all rxskb. So we don't need
to do this in the rt2x00queue_alloc_rxskb() function as well.

rt2x00queue_free_skb() unmaps the DMA but doesn't clear the
allocation flag. Since the code is copied from rt2x00queue_unmap_skb()
anyway (and that function does clear the flag) we might as well
use that function directly.

Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

b43: Add debugfs firmware debugging knob

This adds a firmware debugging knob to debugfs.
With this knob it's possible to enable advanced runtime firmware
checks.
For now it only implements one sanity check for the mac-suspend.
In future there'll probably be more.
If CONFIG_B43_DEBUG is disabled, these checks will collapse to nothing.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

b43: Add simple firmware watchdog

This adds a simple firmware watchdog for the opensource firmware.
This will check every 15 seconds, if the firmware zeroed out the watchdog
register. The firmware will do this in its eventloop.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

mac80211: rename TKIP debugging Kconfig symbol

... to MAC80211_TKIP_DEBUG rather than TKIP_DEBUG.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

ssb, b43, b43legacy, b44: Rewrite SSB DMA API

This is a rewrite of the DMA API for SSB devices.
This is needed, because the old (non-existing) "API" made too many bad
assumptions on the API of the host-bus (PCI).
This introduces an almost complete SSB-DMA-API that maps to the lowlevel
bus-API based on the bustype.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

avr32: Fix wrong I/O access size in __raw_readsb

__raw_readsb() should always use byte accesses, never halfword accesses,
to I/O memory.

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Fix sigaltstack behaviour

A signal handler should be able to change the signal stack used for the
next signal by altering the ucontext_t passed as a parameter to the
handler. This does not currently work on avr32 since it doesn't update
the in-kernel signal context from the ucontext_t upon signal handler
return.

Fix it by adding a call to do_sigaltstack() from sys_rt_sigreturn(),
bringing it in line with most other architectures.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
[haavard.skinnemoen@atmel.com: changed patch description]
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Allow board to define oscillator rates

On our custom board we have other oscillator rates than on atngw100 and
atstk100x.

Currently these rates are hardcoded in arch/avr32/mach-at32ap/at32ap700x.c.

This patch moves them into board specific code.

Signed-off-by: Alex Raimondi <raimondi@miromico.ch>
Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: export empty_zero_page

Fixes one of two ext4 build problems:
ERROR: "empty_zero_page" [fs/ext4/ext4dev.ko] undefined!

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

avr32: Provide PCI DMA mapping API

Some non-PCI drivers need the PCI variant of the DMA mapping API.
Include <asm-generic/pci-dma-compat.h> to provide this through the
non-PCI DMA mapping API.

Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com>

sched: export cpu_clock

the rcutorture module relies on cpu_clock.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: make sched_{rt,fair}.c ifdefs more readable

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: bias effective_load() error towards failing wake_affine().

Measurement shows that the difference between cgroup:/ and cgroup:/foo
wake_affine() results is that the latter succeeds significantly more.

Therefore bias the calculations towards failing the test.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: incremental effective_load()

Increase the accuracy of the effective_load values.

Not only consider the current increment (as per the attempted wakeup), but
also consider the delta between when we last adjusted the shares and the
current situation.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: correct wakeup weight calculations

rw_i = {2, 4, 1, 0}
s_i = {2/7, 4/7, 1/7, 0}

wakeup on cpu0, weight=1

rw'_i = {3, 4, 1, 0}
s'_i = {3/8, 4/8, 1/8, 0}

s_0 = S * rw_0 / \Sum rw_j ->
\Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct)

s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) =
1 * (2+1) / (7+1) = 3/8 (correct

so we find that adding 1 to cpu0 gains 5/56 in weight
if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: fix mult overflow

It was observed these mults can overflow.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: update shares on wakeup

We found that the affine wakeup code needs rather accurate load figures
to be effective. The trouble is that updating the load figures is fairly
expensive with group scheduling. Therefore ratelimit the updating.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: fix shares boost logic

In case the domain is empty, pretend there is a single task on each cpu, so
that together with the boost logic we end up giving 1/n shares to each
cpu.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: disable source/target_load bias

The bias given by source/target_load functions can be very large, disable
it by default to get faster convergence.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: optimize effective_load()

s_i = S * rw_i / \Sum_j rw_j

-> \Sum_j rw_j = S * rw_i / s_i

-> s'_i = S * (rw_i + w) / (\Sum_j rw_j + w)

delta s = s' - s = S * (rw + w) / ((S * rw / s) + w)
= s * (S * (rw + w) / (S * rw + s * w) - 1)

a = S*(rw+w), b = S*rw + s*w

delta s = s * (a-b) / b

IOW, trade one divide for two multiplies

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: remove prio preference from balance decisions

Priority looses much of its meaning in a hierarchical context. So don't
use it in balance decisions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

sched: fix task_h_load()

Currently task_h_load() computes the load of a task and uses that to either
subtract it from the total, or add to it.

However, removing or adding a task need not have any effect on the total load
at all. Imagine adding a task to a group that is local to one cpu - in that
case the total load of that cpu is unaffected.

So properly compute addition/removal:

s_i = S * rw_i / \Sum_j rw_j
s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg)

then s'_i - s_i gives the change in load.

Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight
for that cpu, wl the weight we add (subtract) and wg the weight contribution to
the runqueue.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>