From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1AA7EC3DA64 for ; Mon, 29 Jul 2024 01:49:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83BC26B009B; Sun, 28 Jul 2024 21:49:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EBC26B009C; Sun, 28 Jul 2024 21:49:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DAF16B009D; Sun, 28 Jul 2024 21:49:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 499B26B009B for ; Sun, 28 Jul 2024 21:49:07 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 08FDD402E0 for ; Mon, 29 Jul 2024 01:49:07 +0000 (UTC) X-FDA: 82391107134.09.BB790A7 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf04.hostedemail.com (Postfix) with ESMTP id F02C240003 for ; Mon, 29 Jul 2024 01:49:03 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=SvjsPqgu; spf=pass (imf04.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722217718; a=rsa-sha256; cv=none; b=AWNO/h7W+aZ63mI72d9ojL7Y3MKElgcNWfFYVmkMvan3ayY+cQImRUy4ytfnkuiubt+JmJ m8pyES64Fzo8+BmKY9J2aAlksZ/bvWsdtIWzLYpVuNWh/5LQ5WhdYhUQeZ0XhMHZxrgr9U fU0SxkYXcLgpmXBJyLoFuMgGCcasAo4= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=SvjsPqgu; spf=pass (imf04.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722217718; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v03Ob2sLScSBvjP5hXFQSN21DJsV2m0NROeoya7TkgA=; b=GJ65Pi+bF3VfBQROk28aHEwO/D769TAOuzcGbYyHSjMplqwURuK3lLclNydmsL1hqZH/pt rHoMQauHCB7l8HO1+0BDuQX6vtS/PDBd5L7k7sz/5uvHwtG+TEZ8qVsfmO6ueervWsGx6o SYOU6485oy/IFrp8EHvP1Q74haywW3s= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1722217741; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=v03Ob2sLScSBvjP5hXFQSN21DJsV2m0NROeoya7TkgA=; b=SvjsPqguN31KjKtHLP4Lot37O5FDfupQiVhkqaxy0O6pgv9Xx/nAuDAgCd3BC7TqsX+wGMVjxZq5u7vbp2sTr0mWXontW7PRrRc7mAx5aCr6va+zMzkHwM+Cw59iZe1m/now5802b6+WogYwQCePl3rJx3AFhTPk5DsdEdJkpYg= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R571e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033032019045;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0WBSgas8_1722217739; Received: from 30.97.56.65(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WBSgas8_1722217739) by smtp.aliyun-inc.com; Mon, 29 Jul 2024 09:49:00 +0800 Message-ID: <2a9fa612-875e-4210-9cd5-a984e9e5cbf7@linux.alibaba.com> Date: Mon, 29 Jul 2024 09:48:59 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1 2/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking To: David Hildenbrand , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Andrew Morton , Muchun Song , Peter Xu , Oscar Salvador , stable@vger.kernel.org References: <20240725183955.2268884-1-david@redhat.com> <20240725183955.2268884-3-david@redhat.com> <0067dfe6-b9a6-4e98-9eef-7219299bfe58@linux.alibaba.com> <4439d559-5acf-4688-a1ad-7626bf027027@linux.alibaba.com> <1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com> From: Baolin Wang In-Reply-To: <1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: 8miy8bxe56kawnng6jtru3w6zjyyxnzi X-Rspamd-Queue-Id: F02C240003 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1722217743-208986 X-HE-Meta: U2FsdGVkX1+UmCJ+31luMnsP8shUKlKe4Fixo0acCxkO2sYwy7enqOpNXQ4CHQ2HBOutoDLmILCTCRLbC4ETxS5Lezf0QVZGPThYkVGq5c7enlDDPab6+witFnNBBP+yfnwBL4scm4a92zamjSNZubjZ13z8aMycv3xcRjZIabtWovyAjpOnn4KVkWsNNYm3afBAtuM+vQXveKbe6nbNOZInOxFc2MD4UfHJBuuGwpweh5iYtK638JeDfTiDkDEpKlhQv09rvaoIhTECNiTjV9KNETHH94LZRNR3Dirpdl0DlAFBg15SO3Q7q/1/L53o1Lj3fXmu4f2If0r00O/KCF4EXKo5oQ2lIe1SPiHZ0p9lJMqxI/VNTCSrmwILpuUSQPuSo6O/XwpTvEpI6hpeLN9PQSlwN1Hsr6/pbovLkxpMrgnmUb6y8iflg+VAb4Xi0aqcfPFZcq0+DI9IcvRH8oMDuhKqg66NJLin2o7LQz33wTcEuMIdImrpu3w+yXqUuwVxbPkxc2xl5pgxQNKFOrUmIrBD9fCtnT9dfBOd21gaZTIQ9pvjwpzlrEeQRWeP8ttzpcDDFCvwLEJIKNAw8TDXjZxlARB4iiXZ8ZsUgQtDwFAMCOXUQa4c26euGIqPa3jgzFiG+Jis20CQAR2C9L6/8ZLUpLOmQnsrpQuB3YHm3kaaoP+oSeOz+5h3UNclBg6EjpELOvkyoil4aOhR/aAdg6Db4Ple+RHuAKEAL96UQu+qulOLsU+vO9Io5Ca7W6kiguJmxfHOlyqlwD59y7o46VynLR9zl9aiCl992+vA0K6/W3urF78LW/KgJbizpcbENA2SC00X+ugSwAdZ8CGj/JaLhzu/wE+kCDRXhR5AjQQxgX28sAQXMw3O+vuCjZpcvpRwJVTf1LoTMO2XCg+F7/lADHkCNfXRgzao7ztVKsxF7vSw11i/mm4IQAfIMTaXrVSuFA7mQgUkvbl kcHvSZXd ZMHuPxItRBjYPTIHRl0OdCsfs7q2Ib1/BWYzOqgLnGE4h43Qvm/5TSwi3G+QfXPD4z1uycysfJp86hB0ikFXVHGNLWPtW0QqzgioFOC4BgQxOe7LY4AW8QkI1pEx7SaAoHNc8QDMT4K/sFOoChisdk9nNuDROLr4vsI8aGuIsVgKftJBcYmHjLn2n+N2riMqSdwfSMweKys7dUlsBGaaDOi4mYbv5G64TOOjVtiE+DsxUu8EgmcwThmpY+wXvjyEX4lS0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/7/26 19:40, David Hildenbrand wrote: > On 26.07.24 11:38, Baolin Wang wrote: >> >> >> On 2024/7/26 16:04, David Hildenbrand wrote: >>> On 26.07.24 04:33, Baolin Wang wrote: >>>> >>>> >>>> On 2024/7/26 02:39, David Hildenbrand wrote: >>>>> We recently made GUP's common page table walking code to also walk >>>>> hugetlb VMAs without most hugetlb special-casing, preparing for the >>>>> future of having less hugetlb-specific page table walking code in the >>>>> codebase. Turns out that we missed one page table locking detail: page >>>>> table locking for hugetlb folios that are not mapped using a single >>>>> PMD/PUD. >>>>> >>>>> Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB >>>>> hugetlb folios on arm64 with 4 KiB base page size). GUP, as it >>>>> walks the >>>>> page tables, will perform a pte_offset_map_lock() to grab the PTE >>>>> table >>>>> lock. >>>>> >>>>> However, hugetlb that concurrently modifies these page tables would >>>>> actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the >>>>> locks would differ. Something similar can happen right now with >>>>> hugetlb >>>>> folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. >>>>> >>>>> Let's make huge_pte_lockptr() effectively uses the same PT locks as >>>>> any >>>>> core-mm page table walker would. >>>> >>>> Thanks for raising the issue again. I remember fixing this issue 2 >>>> years >>>> ago in commit fac35ba763ed ("mm/hugetlb: fix races when looking up a >>>> CONT-PTE/PMD size hugetlb page"), but it seems to be broken again. >>>> >>> >>> Ah, right! We fixed it by rerouting to hugetlb code that we then >>> removed :D >>> >>> Did we have a reproducer back then that would make my live easier? >> >> I don't have any reproducers right now. I remember I added some ugly >> hack code (adding delay() etc.) in kernel to analyze this issue, and not >> easy to reproduce. :( > > Hah! > > I tried with 32MB without luck -- migration simply takes too long. 64KB > did the trick within seconds! > > > On a VM with 2 vNUMA nodes, after reserving a bunch of 64KiB hugetlb pages: > > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 0 size: 64439 MB > node 0 free: 64097 MB > node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > node 1 size: 64205 MB > node 1 free: 63809 MB > node distances: > node   0   1 >   0:  10  20 >   1:  20  10 > # echo 100  > > /sys/kernel/mm/hugepages/hugepages-64kB/nr_hugepagespages-64kB/nr_hugepagesepages-64kB/nr_hugepages > # gcc reproducer.c -o reproducer -O3 -lnuma -lpthread > # ./reproducer > [ 3105.936100] ------------[ cut here ]------------ > [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 > try_grab_folio+0x11c/0x188 > [ 3105.944634] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 > nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct > nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill > ip_set nf_tables qrtr sunrpc vfat fat ext4 mbcache jbd2 virtio_net > net_failover failover virtio_balloon dimlib loop dm_multipath nfnetlink > zram xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk > virtio_console virtio_mmio dm_mirror dm_region_hash dm_log dm_mod fuse > [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted > 6.10.0-64.eln141.aarch64 #1 > [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS > edk2-20240524-4.fc40 05/24/2024 > [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS > BTYPE=--) > [ 3105.991108] pc : try_grab_folio+0x11c/0x188 > [ 3105.994013] lr : follow_page_pte+0xd8/0x430 > [ 3105.996986] sp : ffff80008eafb8f0 > [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: > 00f80001207cff43 > [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: > ffff80008eafba48 > [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: > ffff7a546c1aa978 > [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: > 0000000000000001 > [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: > 0000000000000000 > [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: > 0000000000000000 > [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : > ffffb854771b12f0 > [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : > 0008000000000080 > [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : > ffffffe8d481f000 > [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : > 0000000000000000 > [ 3106.047957] Call trace: > [ 3106.049522]  try_grab_folio+0x11c/0x188 > [ 3106.051996]  follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 > [ 3106.055527]  follow_page_mask+0x1a0/0x2b8 > [ 3106.058118]  __get_user_pages+0xf0/0x348 > [ 3106.060647]  faultin_page_range+0xb0/0x360 > [ 3106.063651]  do_madvise+0x340/0x598 > ... > > > > # cat reproducer.c > #include > #include > #include > #include > #include > #include > #include > > #define SIZE_64KB (64 * 1024ul) > > static void *thread_fn(void *arg) > { >         char *mem = arg; > >         /* Let GUP go crazy on the page without grabbing a reference. */ >         while (1) { >                 if (madvise(mem, SIZE_64KB, MADV_POPULATE_WRITE)) { >                         fprintf(stderr, "madvise() failed: %d\n", errno); >                 } >         } > } > > int main(void) > { >         pthread_t thread; >         unsigned int i; >         char *mem; > >         mem = mmap(0, SIZE_64KB, PROT_READ|PROT_WRITE, >                    MAP_ANON|MAP_PRIVATE|MAP_HUGETLB|MAP_HUGE_64KB, -1, 0); >         if (mem == MAP_FAILED) { >                 fprintf(stderr, "mmap() failed: %d\n", errno); >                 return -1; >         } > >         memset(mem, 0, SIZE_64KB); > >         pthread_create(&thread, NULL, thread_fn, mem); > >         /* Migrate it back and forth between two nodes. */ >         for (i = 0; ; i++) { >                 void *pages[1] = { mem, }; >                 int nodes[1] = { i % 2, }; >                 int status[1] = { 0 }; > >                 if (move_pages(0, 1, pages, nodes, status, > MPOL_MF_MOVE_ALL)) >                         fprintf(stderr, "move_pages failed: %d\n", errno); >                 if (status[0] != nodes[0]) >                         printf("migration failed\n"); >         } >         return 0; > } Great! Thanks for creating the reproducer (I will also try it if I find some time).