pilppa.org Git - linux-2.6-omap-h63xx.git/log

]> pilppa.org Git - linux-2.6-omap-h63xx.git/log

projects / linux-2.6-omap-h63xx.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:45 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: generic irq migration support from process context

Generic infrastructure for migrating the irq from the process context in the
presence of CONFIG_GENERIC_PENDING_IRQ.

This will be used later for migrating irq in the presence of
interrupt-remapping.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:44 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: routines managing Interrupt remapping table entries.

Routines handling the management of interrupt remapping table entries.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:43 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: Interrupt remapping infrastructure

Interrupt remapping (part of Intel Virtualization Tech for directed I/O)
infrastructure.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:42 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: Queued invalidation infrastructure (part of VT-d)

Queued invalidation (part of Intel Virtualization Technology for
Directed I/O architecture) infrastructure.

This will be used for invalidating the interrupt entry cache in the
case of Interrupt-remapping and IOTLB invalidation in the case
of DMA-remapping.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:41 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: move IOMMU_WAIT_OP() macro to intel-iommu.h

move IOMMU_WAIT_OP() macro to header file.

This will be used by both DMA-remapping and Intr-remapping.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:40 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: parse ioapic scope under vt-d structures

Parse the vt-d device scope structures to find the mapping between IO-APICs
and the interrupt remapping hardware units.

This will be used later for enabling Interrupt-remapping for IOAPIC devices.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:39 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: Fix the need for RMRR in the DMA-remapping detection

Presence of RMRR structures is not compulsory for enabling DMA-remapping.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Yong Y Wang <yong.y.wang@intel.com>
Cc: Yong Y Wang <yong.y.wang@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:38 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: use CONFIG_DMAR for DMA-remapping specific code

DMA remapping specific code covered with CONFIG_DMAR in
the generic code which will also be used later for enabling Interrupt-remapping.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:37 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: code re-structuring, to be used by both DMA and Interrupt remapping

Allocate the iommu during the parse of DMA remapping hardware
definition structures. And also, introduce routines for device
scope initialization which will be explicitly called during
dma-remapping initialization.

These will be used for enabling interrupt remapping separately from the
existing DMA-remapping enabling sequence.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:36 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: fix the need for sequential array allocation of iommus

Clean up the intel-iommu code related to deferred iommu flush logic. There is
no need to allocate all the iommu's as a sequential array.

This will be used later in the interrupt-remapping patch series to
allocate iommu much early and individually for each device remapping
hardware unit.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Suresh Siddha [Thu, 10 Jul 2008 18:16:35 +0000 (11:16 -0700)]

x64, x2apic/intr-remap: Intel vt-d, IOMMU code reorganization

code reorganization of the generic Intel vt-d parsing related routines and linux
iommu routines specific to Intel vt-d.

drivers/pci/dmar.c now contains the generic vt-d parsing related routines
drivers/pci/intel_iommu.c contains the iommu routines specific to vt-d

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: akpm@linux-foundation.org
Cc: arjan@linux.intel.com
Cc: andi@firstfloor.org
Cc: ebiederm@xmission.com
Cc: jbarnes@virtuousgeek.org
Cc: steiner@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Ingo Molnar [Sat, 12 Jul 2008 05:30:05 +0000 (07:30 +0200)]

Merge branch 'x86/core' into x86/x2apic

commit | commitdiff | tree

Ingo Molnar [Sat, 12 Jul 2008 05:29:02 +0000 (07:29 +0200)]

Merge branch 'linus' into x86/core

Conflicts:

arch/x86/mm/ioremap.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Michael Karcher [Fri, 11 Jul 2008 16:04:46 +0000 (18:04 +0200)]

x86: fix ldt limit for 64 bit

Fix size of LDT entries. On x86-64, ldt_desc is a double-sized descriptor.

Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Roland McGrath [Thu, 10 Jul 2008 21:50:39 +0000 (14:50 -0700)]

x86_64: fix delayed signals

On three of the several paths in entry_64.S that call
do_notify_resume() on the way back to user mode, we fail to properly
check again for newly-arrived work that requires another call to
do_notify_resume() before going to user mode.  These paths set the
mask to check only _TIF_NEED_RESCHED, but this is wrong.  The other
paths that lead to do_notify_resume() do this correctly already, and
entry_32.S does it correctly in all cases.

All paths back to user mode have to check all the _TIF_WORK_MASK
flags at the last possible stage, with interrupts disabled.
Otherwise, we miss any flags (TIF_SIGPENDING for example) that were
set any time after we entered do_notify_resume().  More work flags
can be set (or left set) synchronously inside do_notify_resume(), as
TIF_SIGPENDING can be, or asynchronously by interrupts or other CPUs
(which then send an asynchronous interrupt).

There are many different scenarios that could hit this bug, most of
them races.  The simplest one to demonstrate does not require any
race: when one signal has done handler setup at the check before
returning from a syscall, and there is another signal pending that
should be handled.  The second signal's handler should interrupt the
first signal handler before it actually starts (so the interrupted PC
is still at the handler's entry point).  Instead, it runs away until
the next kernel entry (next syscall, tick, etc).

This test behaves correctly on 32-bit kernels, and fails on 64-bit
(either 32-bit or 64-bit test binary).  With this fix, it works.

    #define _GNU_SOURCE
    #include <stdio.h>
    #include <signal.h>
    #include <string.h>
    #include <sys/ucontext.h>

    #ifndef REG_RIP
    #define REG_RIP REG_EIP
    #endif

    static sig_atomic_t hit1, hit2;

    static void
    handler (int sig, siginfo_t *info, void *ctx)
    {
      ucontext_t *uc = ctx;

      if ((void *) uc->uc_mcontext.gregs[REG_RIP] == &handler)
        {
          if (sig == SIGUSR1)
            hit1 = 1;
          else
            hit2 = 1;
        }

      printf ("%s at %#lx\n", strsignal (sig),
              uc->uc_mcontext.gregs[REG_RIP]);
    }

    int
    main (void)
    {
      struct sigaction sa;
      sigset_t set;

      sigemptyset (&sa.sa_mask);
      sa.sa_flags = SA_SIGINFO;
      sa.sa_sigaction = &handler;

      if (sigaction (SIGUSR1, &sa, NULL)
          || sigaction (SIGUSR2, &sa, NULL))
        return 2;

      sigemptyset (&set);
      sigaddset (&set, SIGUSR1);
      sigaddset (&set, SIGUSR2);
      if (sigprocmask (SIG_BLOCK, &set, NULL))
        return 3;

      printf ("main at %p, handler at %p\n", &main, &handler);

      raise (SIGUSR1);
      raise (SIGUSR2);

      if (sigprocmask (SIG_UNBLOCK, &set, NULL))
        return 4;

      if (hit1 + hit2 == 1)
        {
          puts ("PASS");
          return 0;
        }

      puts ("FAIL");
      return 1;
    }

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Rafael J. Wysocki [Sat, 12 Jul 2008 00:50:15 +0000 (02:50 +0200)]

x86: remove conflicting nx6325 and nx6125 quirks

We have two conflicting DMA-based quirks in there for the same set of
boxes (HP nx6325 and nx6125) and one of them actually breaks my box.

So remove the extra code.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: =?iso-8859-1?q?T=F6r=F6k_Edwin?= <edwintorok@gmail.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Thomas Gleixner [Sat, 12 Jul 2008 03:33:30 +0000 (05:33 +0200)]

kernel-paramaters: document pmtmr= command line option

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

commit | commitdiff | tree

Randy Dunlap [Fri, 11 Jul 2008 19:57:31 +0000 (12:57 -0700)]

acpi_pm clccksource: fix printk format warning

Fix printk format warning in acpi_pm clocksource:

linux-next-20080711/drivers/clocksource/acpi_pm.c:231: warning: format '%04lx' expects type 'long unsigned int', but argument 2 has type 'u32'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: akpm <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

commit | commitdiff | tree

Linus Torvalds [Sat, 12 Jul 2008 00:00:17 +0000 (17:00 -0700)]

Merge git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog

* git://git.kernel.org/pub/scm/linux/kernel/git/wim/linux-2.6-watchdog:
[PATCH] IPMI: return correct value from ipmi_write

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Documention update for new ordered mode and delayed allocation

Adding some documentations for delayed allocation and new ordered mode.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Eric Sandeen [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: do not set extents feature from the kernel

We've talked for a while about getting rid of any feature-
setting from the kernel; this gets rid of the code which would
set the INCOMPAT_EXTENTS flag on the first file write when mounted
as ext4[dev].

With this patch, if the extents feature is not already set on disk,
then mounting as ext4 will fall back to noextents with a warning,
and if -o extents is explicitly requested, the mount will fail,
also with warning.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Don't allow nonextenst mount option for large filesystem

The block mapped inode format can address only blocks within 2**32. This
causes a number of issues, the biggest of which is that the block
allocator needs to be taught that certain inodes can not utilize block
numbers > 2**32. So until this is fixed, it is simplest to fail
mounting of file systems with more than 2**32 blocks if the -o noextents
option is given.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Enable delalloc by default.

Enable delalloc by default to ensure it gets sufficient testing and
because it makes the filesystem much more efficient. Add a nodealalloc
option to disable delayed allocation, and update ext4_show_options to
show delayed allocation off if it is disabled.

If the data=journal mount option is used, disable delayed allocation
since the delalloc code doesn't support data=journal yet.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: delayed allocation i_blocks fix for stat

Right now i_blocks is not getting updated until the blocks are actually
allocaed on disk. This means with delayed allocation, right after files
are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du"
also shows the files taking zero space, which is highly confusing to the
user.

Since delayed allocation already keeps track of per-inode total
number of blocks that are subject to delayed allocation, this patch fix
this by using that to adjust the value returned by stat(2). When real
block allocation is done, the i_blocks will get updated. Since the
reserved blocks for delayed allocation will be decreased, this will be
keep value returned by stat(2) consistent.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: fix delalloc i_disksize early update issue

Ext4_da_write_end() used walk_page_buffers() with a callback function of
ext4_bh_unmapped_or_delay() to check if it extended the file size
without allocating any blocks (since in this case i_disksize needs to be
updated).  However, this is didn't work proprely because the buffer head
has not been marked dirty yet --- this is done later in
block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to
always return false.

In addition, walk_page_buffers() checks all of the buffer heads covering
the page, and the only buffer_head that should be checked is the one
covering the end of the write.  Otherwise, given a 1k blocksize
filesystem and a 4k page size, the buffer head covering the first 1k
stripe of the file could be unmapped (because it was a sparse file), and
the second or third buffer_head covering that page could be mapped, and
using walk_page_buffers() would fail in this case since it would stop at
the first unmapped buffer_head and return true.

The core problem is that walk_page_buffers() was intended to do work in
a callback function, and a non-zero return value indicated a failure,
which termined the walk of the buffer heads covering the page.  It was
not intended to be used with a boolean function, such as
ext4_bh_unmapped_or_delay().

Add addtional fix from Aneesh to protect i_disksize update rave with truncate.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Handle page without buffers in ext4_*_writepage()

It can happen that buffers are removed from the page before it gets
marked dirty and then is passed to writepage().  In writepage() we just
initialize the buffers and check whether they are mapped and non
delay. If they are mapped and non delay we write the page. Otherwise we
mark them dirty.  With this change we don't do block allocation at all
in ext4_*_write_page.

writepage() can get called under many condition and with a locking order
of journal_start -> lock_page, we should not try to allocate blocks in
writepage() which get called after taking page lock.  writepage() can
get called via shrink_page_list even with a journal handle which was
created for doing inode update.  For example when doing
ext4_da_write_begin we create a journal handle with credit 1 expecting a
i_disksize update for the inode. But ext4_da_write_begin can cause
shrink_page_list via _grab_page_cache. So having a valid handle via
ext4_journal_current_handle is not a guarantee that we can use the
handle for block allocation in writepage, since we shouldn't be using
credits that had been reserved for other updates.  That it could result
in we running out of credits when we update inodes.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Add ordered mode support for delalloc

This provides a new ordered mode implementation which gets rid of using
buffer heads to enforce the ordering between metadata change with the
related data chage. Instead, in the new ordering mode, it keeps track
of all of the inodes touched by each transaction on a list, and when
that transaction is committed, it flushes all of the dirty pages for
those inodes. In addition, the new ordered mode reverses the lock
ordering of the page lock and transaction lock, which provides easier
support for delayed allocation.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Invert lock ordering of page_lock and transaction start in delalloc

With the reverse locking, we need to start a transation before taking
the page lock, so in ext4_da_writepages() we need to break the write-out
into chunks, and restart the journal for each chunck to ensure the
write-out fits in a single transaction.

Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
which fixes delalloc sync hang with journal lock inversion, and address
the performance regression issue.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

mm: Add range_cont mode for writeback

Filesystems like ext4 needs to start a new transaction in
the writepages for block allocation. This happens with delayed
allocation and there is limit to how many credits we can request
from the journal layer. So we call write_cache_pages multiple
times with wbc->nr_to_write set to the maximum possible value
limitted by the max journal credits available.

Add a new mode to writeback that enables us to handle this
behaviour. In the new mode we update the wbc->range_start
to point to the new offset to be written. Next call to
call to write_cache_pages will start writeout from specified
range_start offset. In the new mode we also limit writing
to the specified wbc->range_end.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Mon, 14 Jul 2008 21:52:37 +0000 (17:52 -0400)]

ext4: delayed allocation ENOSPC handling

This patch does block reservation for delayed
allocation, to avoid ENOSPC later at page flush time.

Blocks(data and metadata) are reserved at da_write_begin()
time, the freeblocks counter is updated by then, and the number of
reserved blocks is store in per inode counter.

At the writepage time, the unused reserved meta blocks are returned
back. At unlink/truncate time, reserved blocks are properly released.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix the oldallocator block reservation accounting with delalloc, added
lock to guard the counters and also fix the reservation for meta blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

percpu_counter: new function percpu_counter_sum_and_set

Delayed allocation need to check free blocks at every write time.
percpu_counter_read_positive() is not quit accurate. delayed
allocation need a more accurate accounting, but using
percpu_counter_sum_positive() is frequently is quite expensive.

This patch added a new function to update center counter when sum
per-cpu counter, to increase the accurate rate for next
percpu_counter_read() and require less calling expensive
percpu_counter_sum().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Alex Tomas [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Add delayed allocation support in data=writeback mode

Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and
release the page from page cache if the delalloc write_begin failed, and
properly handle preallocated blocks. Also added a fix to clear
buffer_delay in block_write_full_page() after allocating a delayed
buffer.

Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to update i_disksize properly and to add bmap support for delayed
allocation.

Updated with a fix from Valerie Clement <valerie.clement@bull.net> to
avoid filesystem corruption when the filesystem is mounted with the
delalloc option and blocksize < pagesize.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

commit | commitdiff | tree

Alex Tomas [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

vfs: add hooks for ext4's delayed allocation support

Export mpage_bio_submit() and __mpage_writepage() for the benefit of
ext4's delayed allocation support. Also change __block_write_full_page
so that if buffers that have the BH_Delay flag set it will call
get_block() to get the physical block allocated, just as in the
!BH_Mapped case.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

jbd2: Remove data=ordered mode support using jbd buffer heads

Signed-off-by: Jan Kara <jack@suse.cz>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Use new framework for data=ordered mode in JBD2

This patch makes ext4 use inode-based implementation of data=ordered mode
in JBD2. It allows us to unify some data=ordered and data=writeback paths
(especially writepage since we don't have to start a transaction anymore)
and remove some buffer walking.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix file system hang due to corrupt jinode values.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

jbd2: Implement data=ordered mode handling via inodes

This patch adds necessary framework into JBD2 to be able to track inodes
with each transaction and write-out their dirty data during transaction
commit time.

This new ordered mode brings all sorts of advantages such as possibility
to get rid of journal heads and buffer heads for data buffers in ordered
mode, better ordering of writes on transaction commit, simplification of
some JBD code, no more anonymous pages when truncate of data being
committed happens. Also with this new ordered mode, delayed allocation
on ordered mode is much simpler.

Signed-off-by: Jan Kara <jack@suse.cz>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

vfs: export filemap_fdatawrite_range()

Make filemap_fdatawrite_range() function public, so that it can later
be used in ordered mode rewrite by JBD/JBD2.

Signed-off-by: Jan Kara <jack@suse.cz>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Fix lock inversion in ext4_ext_truncate()

We cannot call ext4_orphan_add() from under i_data_sem because that
causes a lock ordering violation between i_data_sem and and the
superblock lock.

Updated with Aneesh's locking order fix

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Invert the locking order of page_lock and transaction start

This changes are needed to support data=ordered mode handling via
inodes. This enables us to get rid of the journal heads and buffer
heads for data buffers in the ordered mode. With the changes, during
tranasaction commit we writeout the inode pages using the
writepages()/writepage(). That implies we take page lock during
transaction commit. This can cause a deadlock with the locking order
page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait
for the journal_commit to happen and the journal_commit now needs to
take the page lock. To avoid this dead lock reverse the locking order.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

vfs: Move mark_inode_dirty() from under page lock in generic_write_end()

There's no need to call mark_inode_dirty() under page lock in
generic_write_end(). It unnecessarily makes hold time of page lock longer
and more importantly it forces locking order of page lock and transaction
start for journaling filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Use page_mkwrite vma_operations to get mmap write notification.

We would like to get notified when we are doing a write on mmap section.
This is needed with respect to preallocated area. We split the preallocated
area into initialzed extent and uninitialzed extent in the call back. This
let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and
that would result in data loss. The changes are also needed to handle ENOSPC
when writing to an mmap section of files with holes.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jose R. Santos [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Documentation updates.

Some of the information in Documentation/filesystems/ext4.txt is out
of date and in need of an update.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Frederic Bohe [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: fix online resize with mballoc

Update group infos when updating a group's descriptor.
Add group infos when adding a group's descriptor.
Refresh cache pages used by mb_alloc when changes occur.
This will probably need modifications when META_BG resizing will be allowed.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

commit | commitdiff | tree

Eric Sandeen [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: use atomic functions to set bh_state

Use the BUFFER_FNS functions (set_buffer_foo) to set buffer
head state atomically instead of nonatomic __set_bit().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jan Kara [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Set journal pointer to NULL when journal is released

Set sbi->s_journal to NULL after we call journal_destroy(). This
will be later needed because after journal_destroy() is called,
ext4_clear_inode() can still be called for some inodes (e.g. root
inode) and we'll need to detect there that journal doesn't exists
anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: mballoc avoid use root reserved blocks for non root allocation

mballoc allocation missed check for blocks reserved for root users. Add
ext4_has_free_blocks() check before allocation. Also modified
ext4_has_free_blocks() to support multiple block allocation request.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Eric Sandeen [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: call blkdev_issue_flush on fsync

To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.

Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: cleanup block allocator

Move the code related to block allocation to a single function and add helper
funtions to differient allocation for data and meta data blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Use inode preallocation with -o noextents

When mballoc is enabled, block allocation for old block-based
files are allocated using mballoc allocator instead of old
block-based allocator. The old ext3 block reservation is turned
off when mballoc is turned on.

However, the in-core preallocation is not enabled for block-based/
non-extent based file block allocation. This result in performance
regression, as now we don't have "reservation" ore in-core preallocation
to prevent interleaved fragmentation in multiple writes workload.

This patch fix this by enable per inode in-core preallocation
for non extent files when mballoc is used.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Akinobu Mita [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: fix ext4_init_block_bitmap() for metablock block group

When meta_bg feature is enabled and s_first_meta_bg != 0,
ext4_init_block_bitmap() miscalculates the number of block used by
the group descriptor table (0 or 1 for metablock block group)

This patch fixes this by using ext4_bg_num_gdb()

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Andreas Dilger <adilger@sun.com>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Fix sparse warning

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Li Zefan [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: cleanup never-used magic numbers from htree code

dx_root_limit() will had some dead code which forced it to always return
20, and dx_node_limit to always return 22 for debugging purposes.
Remove it.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Fix ext4_ext_journal_restart() to reflect errors up to the caller

Fix ext4_ext_journal_restart() so it returns any errors reported by
ext4_journal_extend() and ext4_journal_restart().

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Make ext4_ext_find_extent fills ext_path completely

When pos=0 or depth, the fields of ext4_ext_path is are not
completely filled. This patch also removes some unnecessary code.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: return error when calling ext4_ext_split failed

ext4_ext_create_new_leaf must return error when its
calling to ext4_ext_split failed.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Update i_disksize properly when allocating from fallocate area.

When allocating unitialized space at the end of file which had been
preallocated with the FALLOC_FL_KEEP_SIZE option, the file size is not
updated at that time. But the later we are not updating the file size
when writing to that preallocated space.

These changes are for code correctness. This patch allows us to update
the i_disksize at the write_end() callback of filesystem properly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: remove quota allocation when ext4_mb_new_blocks fails

Quota allocation is not removed when ext4_mb_new_blocks calls
kmem_cache_alloc failed. Also make sure the allocation context is freed
on the error path.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Li Zefan [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: remove redundant code in ext4_fill_super()

The previous sb_min_blocksize() has already set the block size.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Mon, 14 Jul 2008 01:06:39 +0000 (21:06 -0400)]

jbd2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction

journal_try_to_free_buffers() could race with jbd commit transaction
when the later is holding the buffer reference while waiting for the
data buffer to flush to disk. If the caller of
journal_try_to_free_buffers() request tries hard to release the buffers,
it will treat the failure as error and return back to the caller. We
have seen the directo IO failed due to this race. Some of the caller of
releasepage() also expecting the buffer to be dropped when passed with
GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().

With this patch, if the caller is passing the GFP_KERNEL to indicating
this call could wait, in case of try_to_free_buffers() failed, let's
waiting for journal_commit_transaction() to finish commit the current
committing transaction , then try to free those buffers again with
journal locked.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Jose R. Santos [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: New inode allocation for FLEX_BG meta-data groups.

This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping.  It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs.  Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward.  This helps improve
performance by reducing seek time.  Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Theodore Ts'o [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

jbd2: Add commit time into the commit block

Carlo Wood has demonstrated that it's possible to recover deleted
files from the journal. Something that will make this easier is if we
can put the time of the commit into commit block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Stoyan Gaydarov [Mon, 14 Jul 2008 01:03:29 +0000 (21:03 -0400)]

ext4: replace __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__ instead

Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Mon, 14 Jul 2008 01:03:31 +0000 (21:03 -0400)]

ext4: fix error processing in mb_free_blocks

The error processing of the return value of mb_free_blocks is meanless
because it only returns 0. This fix includes

- make mb_free_blocks return void

- remove the error processing part in callers

- unlock group before calling ext4_error in mb_free_blocks

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Mon, 14 Jul 2008 01:03:31 +0000 (21:03 -0400)]

ext4: error proc entry creation when the fs/ext4 is not correctly created

When the directory fs/ext4 is not correctly created under proc, the entry
under this directory should not be created.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Li Zefan [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: fix build failure if DX_DEBUG is enabled

ext4_next_entry() is used by the debugging function dx_show_leaf(), so
it must be defined before that function.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Theodore Ts'o [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Remove unused variable from ext4_show_options

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Theodore Ts'o [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Rename read_block_bitmap() to ext4_read_block_bitmap()

Since this a non-static function, make it be ext4 specific to avoid
conflicts with potentially other filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: remove double definitions of xattr macros

remove the definitions of macros XATTR_TRUSTED_PREFIX and XATTR_USER_PREFIX
since they are defined in linux/xattr.h

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: miscellaneous error checks and coding cleanups for mballoc

ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL

ext4_mb_history_init(): replace kmalloc and memset with kzalloc

ext4_mb_init_backend(): remove memset since kzalloc is used

ext4_mb_init(): the return value of ext4_mb_init_backend is int,
but i is unsigned, replace it with a new int variable.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: add error processing when calling ext4_mb_init_cache in mballoc

Add error processing for ext4_mb_load_buddy when it calls
ext4_mb_init_cache.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Mingming Cao [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Fix ext4_mb_init_cache return error

ext4_mb_init_cache() incorrectly always return EIO on success. This
causes the caller of ext4_mb_init_cache() fail when it checks the return
value.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: improve some code in rb tree part of dir.c

* remove unnecessary code in free_rb_tree_fname

* rename free_rb_tree_fname to ext4_htree_create_dir_info
since it and ext4_htree_free_dir_info are a pair

* replace kmalloc with kzalloc in ext4_htree_free_dir_info

All these make the code more readable and simple.
PS: this patch is also suitable for ext3.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Alexey Dobriyan [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: switch to seq_files

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit | commitdiff | tree

Julia Lawall [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Use BUG_ON() instead of BUG()

if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@ disable unlikely @ expression E,f; @@

(
if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)

@@ expression E,f; @@

(
if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: start searching for the right extent from the goal group.

With mballoc we search for the best extent using different
criteria. We should always use the goal group when we are
starting with a new criteria.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Shen Feng [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: fix comments to say "ext4"

Change second/third to fourth.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Aneesh Kumar K.V [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: Fix mb_find_next_bit not to return larger than max

Some architectures implement ext4_find_next_bit and
ext4_find_next_zero_bit in such a way that they return
greater than max for some input values. Make sure
mb_find_next_bit and mb_find_next_zero_bit return the
right values.

On 2.6.25 we have include/asm-x86/bitops_32.h
static inline unsigned find_first_bit(const unsigned long *addr, unsigned size)
{
unsigned x = 0;

while (x < size) {
unsigned long val = *addr++;
if (val)
return __ffs(val) + x;
x += (sizeof(*addr)<<3);
}
return x;
}

This can return value greater than size.

Reported and fixed here for lustre

https://bugzilla.lustre.org/show_bug.cgi?id=15932
https://bugzilla.lustre.org/attachment.cgi?id=17205

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

commit | commitdiff | tree

Duane Griffin [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: validate directory entry data before use

ext4_dx_find_entry uses ext4_next_entry without verifying that the entry is
valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop
to check the validity of entries before checking whether they match and
moving onto the next one.

There are other uses of ext4_next_entry in this file which also look
problematic. They should be reviewed and fixed if/when we have a test-case
that triggers them.

This patch fixes the first case (image hdb.25.softlockup.gz) reported in
http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Duane Griffin [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: handle deleting corrupted indirect blocks

While freeing indirect blocks we attach a journal head to the parent buffer
head, free the blocks, then journal the parent. If the indirect block list
is corrupted and points to the parent the journal head will be detached
when the block is cleared, causing an OOPS.

Check for that explicitly and handle it gracefully.

This patch fixes the third case (image hdb.20000057.nullderef.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Duane Griffin [Fri, 11 Jul 2008 23:27:31 +0000 (19:27 -0400)]

ext4: handle corrupted orphan list at mount

If the orphan node list includes valid, untruncatable nodes with nlink > 0
the ext4_orphan_cleanup loop which attempts to delete them will not do so,
causing it to loop forever. Fix by checking for such nodes in the
ext4_orphan_get function.

This patch fixes the second case (image hdb.20000009.softlockup.gz)
reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

commit | commitdiff | tree

Roland Dreier [Fri, 11 Jul 2008 20:54:40 +0000 (13:54 -0700)]

IB/umad: BKL is not needed for ib_umad_open()

Remove explicit lock_kernel() calls and document why the code is safe.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

commit | commitdiff | tree

Mark Rustad [Thu, 10 Jul 2008 19:27:11 +0000 (14:27 -0500)]

[PATCH] IPMI: return correct value from ipmi_write

This patch corrects the handling of write operations to the IPMI watchdog
to work as intended by returning the number of characters actually
processed. Without this patch, an "echo V >/dev/watchdog" enables the
watchdog if IPMI is providing the watchdog function.

Signed-off-by: Mark Rustad <MRustad@gmail.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

commit | commitdiff | tree

Robert Richter [Fri, 11 Jul 2008 10:26:59 +0000 (12:26 +0200)]

x86/pci: Changing subsystem init for visws

I don't know, if this new code boots, but at least it
compiles. Someone should really test it.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Robert Richter [Fri, 11 Jul 2008 10:18:41 +0000 (12:18 +0200)]

x86/pci: renaming numa into numaq

Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Robert Richter [Fri, 11 Jul 2008 10:18:40 +0000 (12:18 +0200)]

x86/pci: renamed: numa.c -> numaq_32.c

Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Robert Richter [Fri, 11 Jul 2008 10:14:27 +0000 (12:14 +0200)]

x86/pci: Changing subsystem initialization order for NUMA

Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Robert Richter [Fri, 11 Jul 2008 10:13:59 +0000 (12:13 +0200)]

x86/pci: Removing pci-y in Makefile

Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Ingo Molnar [Fri, 11 Jul 2008 19:22:18 +0000 (21:22 +0200)]

Merge branch 'x86/generalize-visws' into x86/core

commit | commitdiff | tree

Maciej W. Rozycki [Fri, 11 Jul 2008 18:47:15 +0000 (19:47 +0100)]

x86: Recover timer_ack lost in the merge of the NMI watchdog

In the course of the recent unification of the NMI watchdog an assignment
to timer_ack to switch off unnecesary POLL commands to the 8259A in the
case of a watchdog failure has been accidentally removed. The statement
used to be limited to the 32-bit variation as since the rewrite of the
timer code it has been relevant for the 82489DX only. This change brings
it back.

Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Maciej W. Rozycki [Fri, 11 Jul 2008 18:35:23 +0000 (19:35 +0100)]

x86: I/O APIC: Never configure IRQ2

There is no such entity as ISA IRQ2.  The ACPI spec does not make it
explicitly clear, but does not preclude it either -- all it says is ISA
legacy interrupts are identity mapped by default (subject to overrides),
but it does not state whether IRQ2 exists or not.  As a result if there is
no IRQ0 override, then IRQ2 is normally initialised as an ISA interrupt,
which implies an edge-triggered line, which is unmasked by default as this
is what we do for edge-triggered I/O APIC interrupts so as not to miss an
edge.

To the best of my knowledge it is useless, as IRQ2 has not been in use
since the PC/AT as back then it was taken by the 8259A cascade interrupt
to the slave, with the line position in the slot rerouted to newly-created
IRQ9.  No device could thus make use of this line with the pair of 8259A
chips.  Now in theory INTIN2 of the I/O APIC may be usable, but the
interrupt of the device wired to it would not be available in the PIC mode
at all, so I seriously doubt if anybody decided to reuse it for a regular
device.

However there are two common uses of INTIN2.  One is for IRQ0, with an
ACPI interrupt override (or its equivalent in the MP table).  But in this
case IRQ2 is gone entirely with INTIN0 left vacant.  The other one is for
an 8959A ExtINTA cascade.  In this case IRQ0 goes to INTIN0 and if ACPI is
used INTIN2 is assumed to be IRQ2 (there is no override and ACPI has no
way to report ExtINTA interrupts).  This is where a problem happens.

The problem is INTIN2 is configured as a native APIC interrupt, with a
vector assigned and the mask cleared.  And the line may indeed get active
and inject interrupts if the master 8959A has its timer interrupt enabled
(it might happen for other interrupts too, but they are normally masked in
the process of rerouting them to the I/O APIC).  There are two cases where
it will happen:

* When the I/O APIC NMI watchdog is enabled.  This is actually a misnomer
  as the watchdog pulses are delivered through the 8259A to the LINT0
  inputs of all the local APICs in the system.  The implication is the
  output of the master 8259A goes high and low repeatedly, signalling
  interrupts to INTIN2 which is enabled too!

  [The origin of the name is I think for a brief period during the
  development we had a capability in our code to configure the watchdog to
  use an I/O APIC input; that would be INTIN2 in this scenario.]

* When the native route of IRQ0 via INTIN0 fails for whatever reason -- as
  it happens with the system considered here.  In this scenario the timer
  pulse is delivered through the 8259A to LINT0 input of the local APIC of
  the bootstrap processor, quite similarly to how is done for the watchdog
  described above.  The result is, again, INTIN2 receives these pulses
  too.  Rafael's system used to escape this scenario, because an incorrect
  IRQ0 override would occupy INTIN2 and prevent it from being unmasked.

My conclusion is IRQ2 should be excluded from configuration in all the
cases and the current exception for ACPI systems should be lifted.  The
reason being the exception not only being useless, but harmful as well.

Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Maciej W. Rozycki [Fri, 11 Jul 2008 18:35:17 +0000 (19:35 +0100)]

x86: L-APIC: Always fully configure IRQ0

Unlike the 32-bit one, the 64-bit variation of the LVT0 setup code for
the "8259A Virtual Wire" through the local APIC timer configuration does
not fully configure the relevant irq_chip structure. Instead it relies on
the preceding I/O APIC code to have set it up, which does not happen if
the I/O APIC variants have not been tried.

The patch includes corresponding changes to the 32-bit variation too
which make them both the same, barring a small syntactic difference
involving sequence of functions in the source. That should work as an aid
with the upcoming merge.

Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Maciej W. Rozycki [Fri, 11 Jul 2008 18:34:36 +0000 (19:34 +0100)]

x86: L-APIC: Set IRQ0 as edge-triggered

IRQ0 is edge-triggered, but the "8259A Virtual Wire" through the local
APIC configuration in the 32-bit version uses the "fasteoi" handler
suitable for level-triggered APIC interrupt. Rewrite code so that the
"edge" handler is used. The 64-bit version uses different code and is
unaffected.

Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Glauber Costa [Fri, 11 Jul 2008 15:36:52 +0000 (12:36 -0300)]

x86: merge dwarf2 headers

Merge dwarf2_32.h and dwarf2_64.h into dwarf2.h.

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Glauber Costa [Thu, 10 Jul 2008 18:17:53 +0000 (15:17 -0300)]

x86: use AS_CFI instead of UNWIND_INFO

In dwarf2_32.h, test for CONFIG_AS_CFI instead of
CONFIG_UNWIND_INFO. Turns out that searching for UNWIND_INFO
returns no match in any Kconfig or Makefile, so we're really
just throwing everything away regarding dwarf frames for i386.

The test that generates CONFIG_AS_CFI does not have anything
x86_64-specific, and right now, checking V=1 builds shows me
that the flags is there anyway, although unused.

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Glauber Costa [Thu, 10 Jul 2008 18:14:57 +0000 (15:14 -0300)]

x86: use ignore macro instead of hash comment

In dwarf_64.h header, use the "ignore" macro the way
i386 does.

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Glauber Costa [Thu, 10 Jul 2008 18:09:20 +0000 (15:09 -0300)]

x86: use matching CFI_ENDPROC

The RING0_INT_FRAME macro defines a CFI_STARTPROC.
So we should really be using CFI_ENDPROC after it.

Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

commit | commitdiff | tree

Brian King [Fri, 11 Jul 2008 18:37:50 +0000 (13:37 -0500)]

[SCSI] ipr: Fix HDIO_GET_IDENTITY oops for SATA devices

Currently, ipr does not support HDIO_GET_IDENTITY to SATA devices.
An oops occurs if userspace attempts to send the command. Since hald
issues the command, ensure we fail the ioctl in ipr. This is a
temporary solution to the oops. Once the ipr libata EH conversion
is upstream, ipr will fully support HDIO_GET_IDENTITY.

Tested-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>

commit | commitdiff | tree

Linus Torvalds [Fri, 11 Jul 2008 18:37:55 +0000 (11:37 -0700)]

Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev

* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  libata-acpi: don't call sleeping function from invalid context
  Added Targa Visionary 1000 IDE adapter to pata_sis.c
  libata-acpi: filter out DIPM enable

commit | commitdiff | tree

Dave Chinner [Fri, 11 Jul 2008 07:43:55 +0000 (17:43 +1000)]

Fix reference counting race on log buffers

When we release the iclog, we do an atomic_dec_and_lock to determine if
we are the last reference and need to trigger update of log headers and
writeout.  However, in xlog_state_get_iclog_space() we also need to
check if we have the last reference count there.  If we do, we release
the log buffer, otherwise we decrement the reference count.

But the compare and decrement in xlog_state_get_iclog_space() is not
atomic, so both places can see a reference count of 2 and neither will
release the iclog.  That leads to a filesystem hang.

Close the race by replacing the atomic_read() and atomic_dec() pair with
atomic_add_unless() to ensure that they are executed atomically.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Tim Shimmin <tes@sgi.com>
Tested-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Arjan van de Ven [Fri, 11 Jul 2008 12:09:55 +0000 (05:09 -0700)]

stackprotector: better self-test

check stackprotector functionality by manipulating the canary briefly
during bootup.

far more robust than trying to overflow the stack. (which is architecture
dependent, etc.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>

Unnamed repository; edit this file 'description' to name the repository.