aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc/mm/hugetlbpage.c
AgeCommit message (Collapse)Author
2013-09-11mm: migrate: check movability of hugepage in unmap_and_move_huge_page()Naoya Horiguchi
Currently hugepage migration works well only for pmd-based hugepages (mainly due to lack of testing,) so we had better not enable migration of other levels of hugepages until we are ready for it. Some users of hugepage migration (mbind, move_pages, and migrate_pages) do page table walk and check pud/pmd_huge() there, so they are safe. But the other users (softoffline and memory hotremove) don't do this, so without this patch they can try to migrate unexpected types of hugepages. To prevent this, we introduce hugepage_migration_support() as an architecture dependent check of whether hugepage are implemented on a pmd basis or not. And on some architecture multiple sizes of hugepages are available, so hugepage_migration_support() also checks hugepage size. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-04Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc changes for the 3.11 merge window. In addition to the usual bug fixes and small updates, the main highlights are: - Support for transparent huge pages by Aneesh Kumar for 64-bit server processors. This allows the use of 16M pages as transparent huge pages on kernels compiled with a 64K base page size. - Base VFIO support for KVM on power by Alexey Kardashevskiy - Wiring up of our nvram to the pstore infrastructure, including putting compressed oopses in there by Aruna Balakrishnaiah - Move, rework and improve our "EEH" (basically PCI error handling and recovery) infrastructure. It is no longer specific to pseries but is now usable by the new "powernv" platform as well (no hypervisor) by Gavin Shan. - I fixed some bugs in our math-emu instruction decoding and made it usable to emulate some optional FP instructions on processors with hard FP that lack them (such as fsqrt on Freescale embedded processors). - Support for Power8 "Event Based Branch" facility by Michael Ellerman. This facility allows what is basically "userspace interrupts" for performance monitor events. - A bunch of Transactional Memory vs. Signals bug fixes and HW breakpoint/watchpoint fixes by Michael Neuling. And more ... I appologize in advance if I've failed to highlight something that somebody deemed worth it." * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) pstore: Add hsize argument in write_buf call of pstore_ftrace_call powerpc/fsl: add MPIC timer wakeup support powerpc/mpic: create mpic subsystem object powerpc/mpic: add global timer support powerpc/mpic: add irq_set_wake support powerpc/85xx: enable coreint for all the 64bit boards powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig powerpc/mpic: Add get_version API both for internal and external use powerpc: Handle both new style and old style reserve maps powerpc/hw_brk: Fix off by one error when validating DAWR region end powerpc/pseries: Support compression of oops text via pstore powerpc/pseries: Re-organise the oops compression code pstore: Pass header size in the pstore write callback powerpc/powernv: Fix iommu initialization again powerpc/pseries: Inform the hypervisor we are using EBB regs powerpc/perf: Add power8 EBB support powerpc/perf: Core EBB support for 64-bit book3s powerpc/perf: Drop MMCRA from thread_struct powerpc/perf: Don't enable if we have zero events ...
2013-07-03mm/hugetlb: use already existing interface huge_page_shiftWanpeng Li
Use the already existing interface huge_page_shift instead of h->order + PAGE_SHIFT. Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-01Merge tag 'v3.10' into nextBenjamin Herrenschmidt
Merge 3.10 in order to get some of the last minute powerpc changes, resolve conflicts and add additional fixes on top of them.
2013-06-21powerpc: Prevent gcc to re-read the pagetablesAneesh Kumar K.V
GCC is very likely to read the pagetables just once and cache them in the local stack or in a register, but it is can also decide to re-read the pagetables. The problem is that the pagetable in those places can change from under gcc. With THP/hugetlbfs the pmd (and pgd for hugetlbfs giga pages) can change under gup_fast. The pages won't be freed untill we finish gup fast because we have irq disabled and we free these pages via rcu callback. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Make linux pagetable walk safe with THP enabledAneesh Kumar K.V
We need to have irqs disabled to handle all the possible parallel update for linux page table without holding locks. Events that we are intersted in while walking page tables are 1) Page fault 2) umap 3) THP split 4) THP collapse A) local_irq_disabled: ------------------------ 1) page fault: A none to valid transition via page fault is not an issue because we would either see a none or valid. If it is none, we would error out the page table walk. We may need to use on stack values when checking for type of page table elements, because if we do if (!is_hugepd()) { if (!pmd_none() { if (pmd_bad() { We could take that bad condition because the pmd got converted to a hugepd after the !is_hugepd check via a hugetlb fault. The right way would be to check for pmd_none higher up or use on stack value. 2) A valid to none conversion via unmap: We can safely walk the upper level table, because we don't remove the the page table entries until rcu grace period. So even if we followed a wrong pointer we still have the pointer valid till the grace period. A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent pte_clear we take _PAGE_BUSY. 3) THP split: A valid transparent hugepage is converted to nomal page. Before we split we do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING So when walking page table we need to check for pmd_trans_splitting and handle that. The pte returned should also need to be checked for _PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save the value of PTE on stack and check for the flag in the local pte value. If we don't have the value set we can safely operate on the local pte value and we atomicaly set _PAGE_BUSY. 4) THP collapse: A normal page gets converted to hugepage. In the collapse path, we mark the pmd none early (pmdp_clear_flush). With irq disabled, if we are aleady walking page table we would see the pmd_none and won't continue. If we see a valid PMD, we should still check for _PAGE_PRESENT before setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Replace find_linux_pte with find_linux_pte_or_hugepteAneesh Kumar K.V
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly document why we don't need to handle transparent hugepages at callsites. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepagesAneesh Kumar K.V
Reviewed-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-21powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common codeAneesh Kumar K.V
We will use this in the later patch for handling THP pages Reviewed-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-20powerpc: Fix bad pmd error with book3E configAneesh Kumar K.V
Book3E uses the hugepd at PMD level and don't encode pte directly at the pmd level. So it will find the lower bits of pmd set and the pmd_bad check throws error. Infact the current code will never take the free_hugepd_range call at all because it will clear the pmd if it find a hugepd pointer. Fix this by clearing bad pmd only if it is not a hugepd pointer. This is regression introduced by e2b3d202d1dba8f3546ed28224ce485bc50010be "powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format" Reported-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: Switch 16GB and 16MB explicit hugepages to a different page table ↵Aneesh Kumar K.V
format We will be switching PMD_SHIFT to 24 bits to facilitate THP impmenetation. With PMD_SHIFT set to 24, we now have 16MB huge pages allocated at PGD level. That means with 32 bit process we cannot allocate normal pages at all, because we cover the entire address space with one pgd entry. Fix this by switching to a new page table format for hugepages. With the new page table format for 16GB and 16MB hugepages we won't allocate hugepage directory. Instead we encode the PTE information directly at the directory level. This forces 16MB hugepage at PMD level. This will also make the page take walk much simpler later when we add the THP support. With the new table format we have 4 cases for pgds and pmds: (1) invalid (all zeroes) (2) pointer to next table, as normal; bottom 6 bits == 0 (3) leaf pte for huge page, bottom two bits != 00 (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30powerpc: New hugepage directory formatAneesh Kumar K.V
Change the hugepage directory format so that we can have leaf ptes directly at page directory avoiding the allocation of hugepage directory. With the new table format we have 3 cases for pgds and pmds: (1) invalid (all zeroes) (2) pointer to next table, as normal; bottom 6 bits == 0 (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table Instead of storing shift value in hugepd pointer we use mmu_psize_def index so that we can fit all the supported hugepage size in 4 bits Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-30mm: remove free_area_cache use in powerpc architectureMichel Lespinasse
As all other architectures have been converted to use vm_unmapped_area(), we are about to retire the free_area_cache. This change simply removes the use of that cache in slice_get_unmapped_area(), which will most certainly have a performance cost. Next one will convert that function to use the vm_unmapped_area() infrastructure and regain the performance. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-05-07powerpc: fix compile fail in hugetlb cmdline parsingPaul Gortmaker
Commit 9fb48c744ba6a4bf58b666f4e6fdac3008ea1bd4 "params: add 3rd arg to option handler callback signature" added an extra arg to the function, but didn't catch all the use cases needing it, causing this compile fail in mpc85xx_defconfig: arch/powerpc/mm/hugetlbpage.c:316:4: error: passing argument 7 of 'parse_args' from incompatible pointer type [-Werror] include/linux/moduleparam.h:317:12: note: expected 'int (*)(char *, char *, const char *)' but argument is of type 'int (*)(char *, char *)' This function has no need to printk out the "doing" value, so just add the arg as an "unused". Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jim Cromie <jim.cromie@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Becky Bruce <beckyb@kernel.crashing.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-28Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Avi Kivity: "Changes include timekeeping improvements, support for assigning host PCI devices that share interrupt lines, s390 user-controlled guests, a large ppc update, and random fixes." This is with the sign-off's fixed, hopefully next merge window we won't have rebased commits. * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: Convert intx_mask_lock to spin lock KVM: x86: fix kvm_write_tsc() TSC matching thinko x86: kvmclock: abstract save/restore sched_clock_state KVM: nVMX: Fix erroneous exception bitmap check KVM: Ignore the writes to MSR_K7_HWCR(3) KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask KVM: PMU: add proper support for fixed counter 2 KVM: PMU: Fix raw event check KVM: PMU: warn when pin control is set in eventsel msr KVM: VMX: Fix delayed load of shared MSRs KVM: use correct tlbs dirty type in cmpxchg KVM: Allow host IRQ sharing for assigned PCI 2.3 devices KVM: Ensure all vcpus are consistent with in-kernel irqchip settings KVM: x86 emulator: Allow PM/VM86 switch during task switch KVM: SVM: Fix CPL updates KVM: x86 emulator: VM86 segments must have DPL 3 KVM: x86 emulator: Fix task switch privilege checks arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation KVM: mmu_notifier: Flush TLBs before releasing mmu_lock ...
2012-03-26params: <level>_initcall-like kernel parametersPawel Moll
This patch adds a set of macros that can be used to declare kernel parameters to be parsed _before_ initcalls at a chosen level are executed. We rename the now-unused "flags" field of struct kernel_param as the level. It's signed, for when we use this for early params as well, in future. Linker macro collating init calls had to be modified in order to add additional symbols between levels that are later used by the init code to split the calls into blocks. Signed-off-by: Pawel Moll <pawel.moll@arm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-03-20powerpc: remove the second argument of k[un]map_atomic()Cong Wang
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-05KVM: PPC: Implement MMU notifiers for Book3S HV guestsPaul Mackerras
This adds the infrastructure to enable us to page out pages underneath a Book3S HV guest, on processors that support virtualized partition memory, that is, POWER7. Instead of pinning all the guest's pages, we now look in the host userspace Linux page tables to find the mapping for a given guest page. Then, if the userspace Linux PTE gets invalidated, kvm_unmap_hva() gets called for that address, and we replace all the guest HPTEs that refer to that page with absent HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit set, which will cause an HDSI when the guest tries to access them. Finally, the page fault handler is extended to reinstantiate the guest HPTE when the guest tries to access a page which has been paged out. Since we can't intercept the guest DSI and ISI interrupts on PPC970, we still have to pin all the guest pages on PPC970. We have a new flag, kvm->arch.using_mmu_notifiers, that indicates whether we can page guest pages out. If it is not set, the MMU notifier callbacks do nothing and everything operates as before. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-01-06Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (185 commits) powerpc: fix compile error with 85xx/p1010rdb.c powerpc: fix compile error with 85xx/p1023_rds.c powerpc/fsl: add MSI support for the Freescale hypervisor arch/powerpc/sysdev/fsl_rmu.c: introduce missing kfree powerpc/fsl: Add support for Integrated Flash Controller powerpc/fsl: update compatiable on fsl 16550 uart nodes powerpc/85xx: fix PCI and localbus properties in p1022ds.dts powerpc/85xx: re-enable ePAPR byte channel driver in corenet32_smp_defconfig powerpc/fsl: Update defconfigs to enable some standard FSL HW features powerpc: Add TBI PHY node to first MDIO bus sbc834x: put full compat string in board match check powerpc/fsl-pci: Allow 64-bit PCIe devices to DMA to any memory address powerpc: Fix unpaired probe_hcall_entry and probe_hcall_exit offb: Fix setting of the pseudo-palette for >8bpp offb: Add palette hack for qemu "standard vga" framebuffer offb: Fix bug in calculating requested vram size powerpc/boot: Change the WARN to INFO for boot wrapper overlap message powerpc/44x: Fix build error on currituck platform powerpc/boot: Change the load address for the wrapper to fit the kernel powerpc/44x: Enable CRASH_DUMP for 440x ... Fix up a trivial conflict in arch/powerpc/include/asm/cputime.h due to the additional sparse-checking code for cputime_t.
2011-12-07powerpc: Add gpages reservation code for 64-bit FSL BOOKEBecky Bruce
For 64-bit FSL_BOOKE implementations, gigantic pages need to be reserved at boot time by the memblock code based on the command line. This adds the call that handles the reservation, and fixes some code comments. It also removes the previous pr_err when reserve_hugetlb_gpages is called on a system without hugetlb enabled - the way the code is structured, the call is unconditional and the resulting error message spurious and confusing. Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07powerpc: hugetlb: modify include usage for FSL BookE codeBecky Bruce
The original 32-bit hugetlb implementation used PPC64 vs PPC32 to determine which code path to take. However, the final hugetlb implementation for 64-bit FSL ended up shared with the FSL 32-bit code so the actual check needs to be FSL_BOOK3E vs everything else. This patch changes the include protections to reflect this. There are also a couple of related comment fixes. Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07powerpc: Update hugetlb huge_pte_alloc and tablewalk code for FSL BOOKEBecky Bruce
This updates the hugetlb page table code to handle 64-bit FSL_BOOKE. The previous 32-bit work counted on the inner levels of the page table collapsing. Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07powerpc: Only define HAVE_ARCH_HUGETLB_UNMAPPED_AREA if PPC_MM_SLICESBecky Bruce
If we don't have slices, we should be able to use the generic hugetlb_get_unmapped_area() code Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc: Fix compiliation with hugetlbfs enabledKumar Gala
arch/powerpc/mm/hugetlbpage.c: In function 'reserve_hugetlb_gpages': arch/powerpc/mm/hugetlbpage.c:312:2: error: implicit declaration of function 'parse_args' Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-06Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits) powerpc/p3060qds: Add support for P3060QDS board powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX powerpc/85xx: Make kexec to interate over online cpus powerpc/fsl_booke: Fix comment in head_fsl_booke.S powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver powerpc/86xx: Correct Gianfar support for GE boards powerpc/cpm: Clear muram before it is in use. drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager powerpc/fsl_msi: add support for "msi-address-64" property powerpc/85xx: Setup secondary cores PIR with hard SMP id powerpc/fsl-booke: Fix settlbcam for 64-bit powerpc/85xx: Adding DCSR node to dtsi device trees powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards powerpc/85xx: fix PHYS_64BIT selection for P1022DS powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map powerpc: respect mem= setting for early memory limit setup powerpc: Update corenet64_smp_defconfig powerpc: Update mpc85xx/corenet 32-bit defconfigs ... Fix up trivial conflicts in: - arch/powerpc/configs/40x/hcu4_defconfig removed stale file, edited elsewhere - arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c: added opal and gelic drivers vs added ePAPR driver - drivers/tty/serial/8250.c moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support
2011-11-02thp: share get_huge_page_tail()Andrea Arcangeli
This avoids duplicating the function in every arch gup_fast. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02powerpc: gup_huge_pmd() return 0 if pte changesAndrea Arcangeli
powerpc didn't return 0 in that case, if it's rolling back the *nr pointer it should also return zero to avoid adding pages to the array at the wrong offset. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Gibson <david@gibson.dropbear.id.au> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02powerpc: gup_hugepte() support THP based tail recountingAndrea Arcangeli
Up to this point the code assumed old refcounting for hugepages (pre-thp). This updates the code directly to the thp mapcount tail page refcounting. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02powerpc: gup_hugepte() avoid freeing the head page too many timesAndrea Arcangeli
We only taken "refs" pins on the head page not "*nr" pins. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02powerpc: get_hugepte() don't put_page() the wrong pageAndrea Arcangeli
"page" may have changed to point to the next hugepage after the loop completed, The references have been taken on the head page, so the put_page must happen there too. This is a longstanding issue pre-thp inclusion. It's totally unclear how these page_cache_add_speculative and pte_val(pte) != pte_val(*ptep) checks are necessary across all the powerpc gup_fast code, when x86 doesn't need any of that: there's no way the page can be freed with irq disabled so we're guaranteed the atomic_inc will happen on a page with page_count > 0 (so not needing the speculative check). The pte check is also meaningless on x86: no need to rollback on x86 if the pte changed, because the pte can still change a CPU tick after the check succeeded and it won't be rolled back in that case. The important thing is we got a reference on a valid page that was mapped there a CPU tick ago. So not knowing the soft tlb refill code of ppc64 in great detail I'm not removing the "speculative" page_count increase and the pte checks across all the code, but unless there's a strong reason for it they should be later cleaned up too. If a pte can change from huge to non-huge (like it could happen with THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs only so I'm not altering that. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-23powerpc: Fix hugetlb with CONFIG_PPC_MM_SLICES=yPaul Mackerras
Commit 41151e77a4 ("powerpc: Hugetlb for BookE") added some #ifdef CONFIG_MM_SLICES conditionals to hugetlb_get_unmapped_area() and vma_mmu_pagesize(). Unfortunately this is not the correct config symbol; it should be CONFIG_PPC_MM_SLICES. The result is that attempting to use hugetlbfs on 64-bit Power server processors results in an infinite stack recursion between get_unmapped_area() and hugetlb_get_unmapped_area(). This fixes it by changing the #ifdef to use CONFIG_PPC_MM_SLICES in those functions and also in book3e_hugetlb_preload(). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20powerpc: Hugetlb for BookEBecky Bruce
Enable hugepages on Freescale BookE processors. This allows the kernel to use huge TLB entries to map pages, which can greatly reduce the number of TLB misses and the amount of TLB thrashing experienced by applications with large memory footprints. Care should be taken when using this on FSL processors, as the number of large TLB entries supported by the core is low (16-64) on current processors. The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g. Page sizes larger than the max zone size are called "gigantic" pages and must be allocated on the command line (and cannot be deallocated). This is currently only fully implemented for Freescale 32-bit BookE processors, but there is some infrastructure in the code for 64-bit BooKE. Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-27powerpc: Free up some CPU feature bits by moving out MMU-related featuresMatt Evans
Some of the 64bit PPC CPU features are MMU-related, so this patch moves them to MMU_FTR_ bits. All cpu_has_feature()-style tests are moved to mmu_has_feature(), and seven feature bits are freed as a result. Signed-off-by: Matt Evans <matt@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2009-11-27powerpc/mm: Fix bug in gup_hugepd()David Gibson
Commit a4fe3ce7699bfe1bd88f816b55d42d8fe1dac655 introduced a new get_user_pages() path for hugepages on powerpc. Unfortunately, there is a bug in it's loop logic, which can cause it to overrun the end of the intended region. This came about by copying the logic from the normal page path, which assumes the address and end parameters have been pagesize aligned at the top-level. Since they're not *hugepage* size aligned, the simplistic logic could step over the end of the gup region without triggering the loop end condition. This patch fixes the bug by using the technique that the normal page path uses in levels above the lowest to truncate the ending address to something we know we'll match with. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30powerpc/mm: Bring hugepage PTE accessor functions back into sync with normal ↵David Gibson
accessors The hugepage arch code provides a number of hook functions/macros which mirror the functionality of various normal page pte access functions. Various changes in the normal page accessors (in particular BenH's recent changes to the handling of lazy icache flushing and PAGE_EXEC) have caused the hugepage versions to get out of sync with the originals. In some cases, this is a bug, at least on some MMU types. One of the reasons that some hooks were not identical to the normal page versions, is that the fact we're dealing with a hugepage needed to be passed down do use the correct dcache-icache flush function. This patch makes the main flush_dcache_icache_page() function hugepage aware (by checking for the PageCompound flag). That in turn means we can make set_huge_pte_at() just a call to set_pte_at() bringing it back into sync. As a bonus, this lets us remove the hash_huge_page_do_lazy_icache() function, replacing it with a call to the hash_page_do_lazy_icache() function it was based on. Some other hugepage pte access hooks - huge_ptep_get_and_clear() and huge_ptep_clear_flush() - are not so easily unified, but this patch at least brings them back into sync with the current versions of the corresponding normal page functions. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30powerpc/mm: Split hash MMU specific hugepage code into a new fileDavid Gibson
This patch separates the parts of hugetlbpage.c which are inherently specific to the hash MMU into a new hugelbpage-hash64.c file. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30powerpc/mm: Cleanup initialization of hugepages on powerpcDavid Gibson
This patch simplifies the logic used to initialize hugepages on powerpc. The somewhat oddly named set_huge_psize() is renamed to add_huge_page_size() and now does all necessary verification of whether it's given a valid hugepage sizes (instead of just some) and instantiates the generic hstate structure (but no more). hugetlbpage_init() now steps through the available pagesizes, checks if they're valid for hugepages by calling add_huge_page_size() and initializes the kmem_caches for the hugepage pagetables. This means we can now eliminate the mmu_huge_psizes array, since we no longer need to pass the sizing information for the pagetable caches from set_huge_psize() into hugetlbpage_init() Determination of the default huge page size is also moved from the hash code into the general hugepage code. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30powerpc/mm: Allow more flexible layouts for hugepage pagetablesDavid Gibson
Currently each available hugepage size uses a slightly different pagetable layout: that is, the bottem level table of pointers to hugepages is a different size, and may branch off from the normal page tables at a different level. Every hugepage aware path that needs to walk the pagetables must therefore look up the hugepage size from the slice info first, and work out the correct way to walk the pagetables accordingly. Future hardware is likely to add more possible hugepage sizes, more layout options and more mess. This patch, therefore reworks the handling of hugepage pagetables to reduce this complexity. In the new scheme, instead of having to consult the slice mask, pagetable walking code can check a flag in the PGD/PUD/PMD entries to see where to branch off to hugepage pagetables, and the entry also contains the information (eseentially hugepage shift) necessary to then interpret that table without recourse to the slice mask. This scheme can be extended neatly to handle multiple levels of self-describing "special" hugepage pagetables, although for now we assume only one level exists. This approach means that only the pagetable allocation path needs to know how the pagetables should be set out. All other (hugepage) pagetable walking paths can just interpret the structure as they go. There already was a flag bit in PGD/PUD/PMD entries for hugepage directory pointers, but it was only used for debug. We alter that flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable pointer (normally it would be 1 since the pointer lies in the linear mapping). This means that asm pagetable walking can test for (and punt on) hugepage pointers with the same test that checks for unpopulated page directory entries (beq becomes bge), since hugepage pointers will always be positive, and normal pointers always negative. While we're at it, we get rid of the confusing (and grep defeating) #defining of hugepte_shift to be the same thing as mmu_huge_psizes. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30powerpc/mm: Cleanup management of kmem_caches for pagetablesDavid Gibson
Currently we have a fair bit of rather fiddly code to manage the various kmem_caches used to store page tables of various levels. We generally have two caches holding some combination of PGD, PUD and PMD tables, plus several more for the special hugepage pagetables. This patch cleans this all up by taking a different approach. Rather than the caches being designated as for PUDs or for hugeptes for 16M pages, the caches are simply allocated to be a specific size. Thus sharing of caches between different types/levels of pagetables happens naturally. The pagetable size, where needed, is passed around encoded in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the pagetable contains 2^n pointers. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30powerpc/mm: Make hpte_need_flush() correctly mask for multiple page sizesDavid Gibson
Currently, hpte_need_flush() only correctly flushes the given address for normal pages. Callers for hugepages are required to mask the address themselves. But hpte_need_flush() already looks up the page sizes for its own reasons, so this is a rather silly imposition on the callers. This patch alters it to mask based on the pagesize it has looked up itself, and removes the awkward masking code in the hugepage caller. Signed-off-by: David Gibson <dwg@au1.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-08-20powerpc: Add memory management headers for new 64-bit BookEBenjamin Herrenschmidt
This adds the PTE and pgtable format definitions, along with changes to the kernel memory map and other definitions related to implementing support for 64-bit Book3E. This also shields some asm-offset bits that are currently only relevant on 32-bit We also move the definition of the "linux" page size constants to the common mmu.h file and add a few sizes that are relevant to embedded processors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-07-27mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()Benjamin Herrenschmidt
mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() Upcoming paches to support the new 64-bit "BookE" powerpc architecture will need to have the virtual address corresponding to PTE page when freeing it, due to the way the HW table walker works. Basically, the TLB can be loaded with "large" pages that cover the whole virtual space (well, sort-of, half of it actually) represented by a PTE page, and which contain an "indirect" bit indicating that this TLB entry RPN points to an array of PTEs from which the TLB can then create direct entries. Thus, in order to invalidate those when PTE pages are deleted, we need the virtual address to pass to tlbilx or tlbivax instructions. The old trick of sticking it somewhere in the PTE page struct page sucks too much, the address is almost readily available in all call sites and almost everybody implemets these as macros, so we may as well add the argument everywhere. I added it to the pmd and pud variants for consistency. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV] Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: report the MMU pagesize in /proc/pid/smapsMel Gorman
The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-16Merge branch 'merge' into nextPaul Mackerras
2008-12-16powerpc: Check for valid hugepage size in hugetlb_get_unmapped_areaBrian King
It looks like most of the hugetlb code is doing the correct thing if hugepages are not supported, but the mmap code is not. If we get into the mmap code when hugepages are not supported, such as in an LPAR which is running Active Memory Sharing, we can oops the kernel. This fixes the oops being seen in this path. oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=1024 NUMA pSeries Modules linked in: nfs(N) lockd(N) nfs_acl(N) sunrpc(N) ipv6(N) fuse(N) loop(N) dm_mod(N) sg(N) ibmveth(N) sd_mod(N) crc_t10dif(N) ibmvscsic(N) scsi_transport_srp(N) scsi_tgt(N) scsi_mod(N) Supported: No NIP: c000000000038d60 LR: c00000000003945c CTR: c0000000000393f0 REGS: c000000077e7b830 TRAP: 0300 Tainted: G (2.6.27.5-bz50170-2-ppc64) MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44000448 XER: 20000001 DAR: c000002000af90a8, DSISR: 0000000040000000 TASK = c00000007c1b8600[4019] 'hugemmap01' THREAD: c000000077e78000 CPU: 6 GPR00: 0000001fffffffe0 c000000077e7bab0 c0000000009a4e78 0000000000000000 GPR04: 0000000000010000 0000000000000001 00000000ffffffff 0000000000000001 GPR08: 0000000000000000 c000000000af90c8 0000000000000001 0000000000000000 GPR12: 000000000000003f c000000000a73880 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000010000 GPR20: 0000000000000000 0000000000000003 0000000000010000 0000000000000001 GPR24: 0000000000000003 0000000000000000 0000000000000001 ffffffffffffffb5 GPR28: c000000077ca2e80 0000000000000000 c00000000092af78 0000000000010000 NIP [c000000000038d60] .slice_get_unmapped_area+0x6c/0x4e0 LR [c00000000003945c] .hugetlb_get_unmapped_area+0x6c/0x80 Call Trace: [c000000077e7bbc0] [c00000000003945c] .hugetlb_get_unmapped_area+0x6c/0x80 [c000000077e7bc30] [c000000000107e30] .get_unmapped_area+0x64/0xd8 [c000000077e7bcb0] [c00000000010b140] .do_mmap_pgoff+0x140/0x420 [c000000077e7bd80] [c00000000000bf5c] .sys_mmap+0xc4/0x140 [c000000077e7be30] [c0000000000086b4] syscall_exit+0x0/0x40 Instruction dump: fac1ffb0 fae1ffb8 fb01ffc0 fb21ffc8 fb41ffd0 fb61ffd8 fb81ffe0 fbc1fff0 fbe1fff8 f821fef1 f8c10158 f8e10160 <7d49002e> f9010168 e92d01b0 eb4902b0 Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-12-03Merge branch 'merge'Paul Mackerras
2008-11-30powerpc set_huge_psize() false positiveAl Viro
called only from __init, calls __init. Incidentally, it ought to be static in file. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06powerpc: Hugetlb pgtable cache access cleanupJon Tollefson
Andrew Morton suggested that using a macro that makes an array reference look like a function call makes it harder to understand the code. This therefore removes the huge_pgtable_cache(psize) macro and replaces its uses with pgtable_cache[HUGE_PGTABLE_INDEX(psize)]. Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-09-15powerpc: Clean up hugepage pagetable allocation for powerpc with 16G pagesDavid Gibson
There is a small bug in the handling of 16G hugepages recently added to the kernel. This doesn't cause a crash or other user-visible problems, but it does mean that more levels of pagetable are allocated than makes sense for 16G pages. The hugepage pagetables for the 16G pages are allocated much lower in the pagetable tree than they should be, with the intervening levels allocated with full pmd and pud pages which will only ever have one entry filled in. This corrects this problem, at the same time cleaning up the handling of which level 64k versus 16M hugepage pagetables are allocated at. The new way of formatting the tests should be more robust against changes in pagetable structure, or any newly added hugepage sizes. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>