From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8710AC021B1 for ; Thu, 20 Feb 2025 09:31:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E57E42802C9; Thu, 20 Feb 2025 04:31:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E08B52802C8; Thu, 20 Feb 2025 04:31:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CCFD62802C9; Thu, 20 Feb 2025 04:31:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B013E2802C8 for ; Thu, 20 Feb 2025 04:31:52 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 5083D1609C7 for ; Thu, 20 Feb 2025 09:31:52 +0000 (UTC) X-FDA: 83139806064.21.98DE7FB Received: from mail-vk1-f177.google.com (mail-vk1-f177.google.com [209.85.221.177]) by imf10.hostedemail.com (Postfix) with ESMTP id 5B446C000C for ; Thu, 20 Feb 2025 09:31:50 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="SYlf/WP4"; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740043910; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IdoNRWjXiW4/PpNlrCyffeatcs49XYxijvXWoaygBr4=; b=Pn5G7qLmIRlfp7+rZ64O+sqTETFRmpjV55IyAXR7PXkfZf9vzkl/oO1gZdjGZYPHQFVwUp hx+1kUGDkbgHPP0Wt7+5rXV51X3bagbc6zTj8mZSYbyuu7D2Aht8RlIJtMeXxV8w5l7mcG XhwN68wJZ9WJKoSUSbqIe5/nH2Q0I3Q= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="SYlf/WP4"; spf=pass (imf10.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740043910; a=rsa-sha256; cv=none; b=Bo0b6Flf8/QK0M5JnAYuRaOB0dPa1nQx9RhZNWw55RbT0Dz7Chxflbqp7lsdDlXX+W0j3U 62waGAmUAj3+vA7BAM5do3vLI6F5Zn7CF2Oo7Y5jl2cXxBhfTmfqKp/On/Lc/J0L1yewY3 xRdf1Pcklgs+rcLluGPFqGZ/++k1LKw= Received: by mail-vk1-f177.google.com with SMTP id 71dfb90a1353d-520a48f37b4so428425e0c.2 for ; Thu, 20 Feb 2025 01:31:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1740043909; x=1740648709; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=IdoNRWjXiW4/PpNlrCyffeatcs49XYxijvXWoaygBr4=; b=SYlf/WP4eL4maER6DX4p6NEJFjWRWbEs9/VTw9JqrGpw3jN+Lu18ANI0HAKzwWAN2e KucSnj3v3XXYiDyeGR/MKb9SWGFXTvh3M17u1yGB0B9L7nAX0B9tLGauVE0zqf377D/v /YsnuJQGYvab1525M6KdCveJ/2CoyL8JGh0+zBEGFUs8mQyPg0gijLqKSAOhzVlDiL9H +1fs1lSnQh3g/JCOp0+mkhqn/4rPWdXa9MsmnFgRXIkHdA7MB2Mo5aBjbT/pJuZIkhW5 uAk/4VNn9QPdvT3JHRg6LgDq7uoATPyNjLDg07k/s5mrVhwusjaPMOGlgbLAndj8Iu/M uWHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740043909; x=1740648709; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=IdoNRWjXiW4/PpNlrCyffeatcs49XYxijvXWoaygBr4=; b=XDwihy0W1jfobwAHINcshJwqWXN2v2qF9ZKkcdEb3+zZroqBpTDiEUnVEIB0tWwB0C 4sZRJA5FPOfeCMJrA1Pz0FoLcdQr1eVhSlgdAur3iP6GUCU74105qQ99ruCovhhaoagL w9hkcjVdYB88NjIjaIsCD/IgB4HElExGuM/qKl5ZFX51y5M9JKEvdJ6RWSDuDgNQoEtB 6wcW67wv4Vem7PO4tGjlykqblecn7GEIk6S/95K6aHKNmNBydjpULUKTfosZJoCtQM1I gNneltkhWbep2HLxen/5oIzAIJAs/4MPcuwazNN8WNdSHIjjz8/W0J574feFvDdEsg6w LGxw== X-Forwarded-Encrypted: i=1; AJvYcCUXMmM01PrB6cIqBlnDu+3SUTKKrjh8Sp9dWBeu7hfY19LI67GTeshNUiKex5kA3vzqx9TjDgpXjA==@kvack.org X-Gm-Message-State: AOJu0YzFK8lNyUnvvijTvyoAvrT5WEawoUo9uGZN/7u5fOxrkSCCWDd+ 0+z2nuhdHVHcg4Ak/dI+yTAvrsPklZNyfRbxnOKmXZHLbsV0cu4maqdxxmmNuPkIDH0ZItuFmT7 bz7RZSkh7MdSM0wRbtiOjLMgopzY= X-Gm-Gg: ASbGncuHta/hGpfBLmB1tiWLVdKIxM5jpykvELD70mkTed+RUS8UnwNnANMB6p7M3JZ bvcEdej6UGdXKMDZZPAgjPOyad6WZVWNdeuNsXownqUqOmJ/+nGo6Y4CclWnw0ETAkQASTAG7 X-Google-Smtp-Source: AGHT+IFyCGv1PPsOvqhEJVbwoms8i4zSSY/RNt+XfvJakkAb76V2N5KSp9tyIzFz4OJoqvDNp6+qOYX6VoLvUBGtUPk= X-Received: by 2002:a05:6122:d16:b0:51f:3fa3:d9af with SMTP id 71dfb90a1353d-5209da76bc9mr11425943e0c.1.1740043909261; Thu, 20 Feb 2025 01:31:49 -0800 (PST) MIME-Version: 1.0 References: <20250219112519.92853-1-21cnbao@gmail.com> <50566d42-7754-4017-b290-f29d92e69231@redhat.com> In-Reply-To: <50566d42-7754-4017-b290-f29d92e69231@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Thu, 20 Feb 2025 22:31:38 +1300 X-Gm-Features: AWEUYZlBISKvK8miIhXC2jY7Ak_u4SVwcS83Dk9YQrVm3BCOJVFmjgt9l8uL66I Message-ID: Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache To: David Hildenbrand Cc: Suren Baghdasaryan , Lokesh Gidra , linux-mm@kvack.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, zhengtangquan@oppo.com, Barry Song , Andrea Arcangeli , Al Viro , Axel Rasmussen , Brian Geffon , Christian Brauner , Hugh Dickins , Jann Horn , Kalesh Singh , "Liam R . Howlett" , Matthew Wilcox , Michal Hocko , Mike Rapoport , Nicolas Geoffray , Peter Xu , Ryan Roberts , Shuah Khan , ZhangPeng , Yu Zhao Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 8zj6kwf9gix4fda8wj33skgpyxw7t1iq X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 5B446C000C X-HE-Tag: 1740043910-475425 X-HE-Meta: U2FsdGVkX19DOk/7tRvWpeJcfF7Z/oT03hmq7lp3Xy4z1z0Sp5B/BJqLChGelFOOYW2HPiCkD+1ZjyVYG+BC+SXcQJPT0mCM/BB3bUL7tl10Q+CMS2EfNrRq9eUamvkjjb9edThBxlbZZgDg+uK5v3ZAbmnDDtB4vPlKwswvJ2dviPggMm66YPD1lSr3OZWsYBNZePSqwElnfqi2rtGHpdQHSQnwVrlV4VcQOxwFMWBVjkhnAE0/t6xu4DWV5FwTCB7KJ7NXrAKamO/OUQ8FM6L1Gpt/BNlbxnPAgtGRchyzO42XrDYtosuxgF6iA/jAXZ+uFL1CZ2067pa+lsNnXQ8wQgZceoBSQRSrIeHfwwPxNhovyiCaAt+vkb4cjpPnH+aua49ZUb+eK/fXgNSve/8I12FOTdzklo3hDzW87v/s3dpJHX0/jrMxWn1sc32Ow5n1ZinCUe1jaei4ooajblicpVBltRqfueVza9N7CYy7/lLXj2Ygqb87yz7yesXGVPhXH4heNW8sPzcx7Wy/+4OyLeEjBEWQKNs/2ezg1tTg8G0K99IxpaD8HnKayuuQUmHt8W+36HWR9nAG2d63W9sbagLgiYzpkPfAP8L7i7QbZJKmS6dXbIgx78383vg6FY4yMdCN+NtEHnCj5RKslr6rqtjb85+vvJloXNVVOUKAGx0KV6XHzePGuSqMapdEbl2CfSA6XzMFc83y9fm7GzTyQuJo5Y4TbUgZ8BasjUrN+L3864efSgZr2btnnBTm+n11iExjYZwi/j0tFshSgPo0s5BsqfSO5kzSzXa0HGcKaIEHZGsQJVucQDv5O6tM/8VJ2WnJIlGpDDMIwrZsDy3yqFFsfmarayl47pRHZfz9rGdjZOTxeQSy7uYFGctuGPiCXkoGZ1VriIyr0rKKeFGd+ldTlh9NJ4wTMc2zthDg4h7pGs3UpCldkRq8ZT3fUmg/zRqazssjjO1q6w0 g+VoDST8 ctUxoUaWcZECdpYC5NojQRBanpPWZkXk9UNkHb2XN9o/sQ7xYKFuowzkgx2AGe2olhzlfkmW+5NtbJ2MhddUaUab15LlL7nnqaicyik6qvxWEKAY0BLPwEmFxvORNPA9NuzGvQFBqGedX1K5X3b24dzCFKHabQNlaodAPULcJApe8RbutmmbmLP11hxtJBNAkKeid7kMnLfnyMMgGHmtn8f5Gg6nTKAfe7Zb05XqdKysQP3nNgXdJ7Sp4nVZSSI6txLPjFOgWEcfkUpDBybO1H1tGBelUwBc/C9245Byv02abdLggdlvDuAlukvHY2rbcaVGVBGpBCbGFlstdh+4RcKzTZlPPFAp3+dRoa7HYtzws842WkjaZXqXqy2JAx6t/OTw+6swKqEB6dFnKdFcCro9bmcQKjUzNviiryPx3I5WYieVPLuLuOGjyofJN/gXo+WlevZClYgBjV58= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 20, 2025 at 9:51=E2=80=AFPM David Hildenbrand wrote: > > On 19.02.25 21:37, Barry Song wrote: > > On Thu, Feb 20, 2025 at 7:27=E2=80=AFAM Suren Baghdasaryan wrote: > >> > >> On Wed, Feb 19, 2025 at 3:25=E2=80=AFAM Barry Song <21cnbao@gmail.com>= wrote: > >>> > >>> From: Barry Song > >>> > >>> userfaultfd_move() checks whether the PTE entry is present or a > >>> swap entry. > >>> > >>> - If the PTE entry is present, move_present_pte() handles folio > >>> migration by setting: > >>> > >>> src_folio->index =3D linear_page_index(dst_vma, dst_addr); > >>> > >>> - If the PTE entry is a swap entry, move_swap_pte() simply copies > >>> the PTE to the new dst_addr. > >>> > >>> This approach is incorrect because even if the PTE is a swap > >>> entry, it can still reference a folio that remains in the swap > >>> cache. > >>> > >>> If do_swap_page() is triggered, it may locate the folio in the > >>> swap cache. However, during add_rmap operations, a kernel panic > >>> can occur due to: > >>> page_pgoff(folio, page) !=3D linear_page_index(vma, address) > >> > >> Thanks for the report and reproducer! > >> > >>> > >>> $./a.out > /dev/null > >>> [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c i= ndex:0xffffaf150 pfn:0x4667c > >>> [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_ma= pped:1 pincount:0 > >>> [ 13.337716] memcg:ffff00000405f000 > >>> [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|ow= ner_priv_1|head|swapbacked|node=3D0|zone=3D0|lastcpupid=3D0xffff) > >>> [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b53= 8 ffff000006260361 > >>> [ 13.338831] raw: 0000000ffffaf150 0000000000004000 000000060000000= 0 ffff00000405f000 > >>> [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b5= 38 ffff000006260361 > >>> [ 13.339204] head: 0000000ffffaf150 0000000000004000 00000006000000= 00 ffff00000405f000 > >>> [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff000000= 00 0000000000000001 > >>> [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffff= ff 0000000000000000 > >>> [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, = page) !=3D linear_page_index(vma, address)) > >>> [ 13.340190] ------------[ cut here ]------------ > >>> [ 13.340316] kernel BUG at mm/rmap.c:1380! > >>> [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREE= MPT SMP > >>> [ 13.340969] Modules linked in: > >>> [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-= rc3-gcf42737e247a-dirty #299 > >>> [ 13.341470] Hardware name: linux,dummy-virt (DT) > >>> [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS = BTYPE=3D--) > >>> [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > >>> [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > >>> [ 13.342018] sp : ffff80008752bb20 > >>> [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 00000= 00000000001 > >>> [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 00000= 00000000001 > >>> [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffff= dffc0199f00 > >>> [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000= 000ffffffff > >>> [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 66286= 6666f67705f > >>> [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff8= 00083728ab0 > >>> [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff8= 0008011bc40 > >>> [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8= 000829eebf8 > >>> [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 00000= 00000000000 > >>> [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 00000= 0000000005f > >>> [ 13.343876] Call trace: > >>> [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > >>> [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > >>> [ 13.344333] do_swap_page+0x1060/0x1400 > >>> [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > >>> [ 13.344504] handle_mm_fault+0xd8/0x2e8 > >>> [ 13.344586] do_page_fault+0x20c/0x770 > >>> [ 13.344673] do_translation_fault+0xb4/0xf0 > >>> [ 13.344759] do_mem_abort+0x48/0xa0 > >>> [ 13.344842] el0_da+0x58/0x130 > >>> [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > >>> [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > >>> [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > >>> [ 13.345504] ---[ end trace 0000000000000000 ]--- > >>> [ 13.345715] note: a.out[107] exited with irqs disabled > >>> [ 13.345954] note: a.out[107] exited with preempt_count 2 > >>> > >>> Fully fixing it would be quite complex, requiring similar handling > >>> of folios as done in move_present_pte. > >> > >> How complex would that be? Is it a matter of adding > >> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > >> folio->index =3D linear_page_index like in move_present_pte() or > >> something more? > > > > My main concern is still with large folios that require a split_folio() > > during move_pages(), as the entire folio shares the same index and > > anon_vma. However, userfaultfd_move() moves pages individually, > > making a split necessary. > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > if (folio_test_writeback(folio)) > > return -EBUSY; > > > > This is likely true for swapcache, right? However, even for move_presen= t_pte(), > > it simply returns -EBUSY: > > > > move_pages_pte() > > { > > /* at this point we have src_folio locked */ > > if (folio_test_large(src_folio)) { > > /* split_folio() can block */ > > pte_unmap(&orig_src_pte); > > pte_unmap(&orig_dst_pte); > > src_pte =3D dst_pte =3D NULL; > > err =3D split_folio(src_folio); > > if (err) > > goto out; > > > > /* have to reacquire the folio after it got sp= lit */ > > folio_unlock(src_folio); > > folio_put(src_folio); > > src_folio =3D NULL; > > goto retry; > > } > > } > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > By the way, I have also reported that userfaultfd_move() has a fundamen= tal > > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android c= ommon > > kernel. In this scenario, folios in the virtual zone won=E2=80=99t be s= plit in > > split_folio(). Instead, the large folio migrates into nr_pages small fo= lios. > > > Thus, the best-case scenario would be: > > > > mTHP -> migrate to small folios in split_folio() -> move small folios t= o > > dst_addr > > > > While this works, it negates the performance benefits of > > userfaultfd_move(), as it introduces two PTE operations (migration in > > split_folio() and move in userfaultfd_move() while retry), nr_pages mem= ory > > allocations, and still requires one memcpy(). This could end up > > performing even worse than userfaultfd_copy(), I guess. > > > The worst-case scenario would be failing to allocate small folios in > > split_folio(), then userfaultfd_move() might return -ENOMEM? > > Although that's an Android problem and not an upstream problem, I'll > note that there are other reasons why the split / move might fail, and > user space either must retry or fallback to a COPY. > > Regarding mTHP, we could move the whole folio if the user space-provided > range allows for batching over multiple PTEs (nr_ptes), they are in a > single VMA, and folio_mapcount() =3D=3D nr_ptes. > > There are corner cases to handle, such as moving mTHPs such that they > suddenly cross two page tables I assume, that are harder to handle when > not moving individual PTEs where that cannot happen. This is a useful suggestion. I=E2=80=99ve heard that Lokesh is also interes= ted in modifying ART to perform moves at the mTHP granularity, which would require kernel modifications as well. It=E2=80=99s likely the direction we=E2=80=99= ll take after fixing the current urgent bugs. The current split_folio() really isn=E2=80= =99t ideal. The corner cases you mentioned are definitely worth considering. However, once we can perform batch UFFDIO_MOVE, I believe that in most cases, the conflict between userfaultfd_move() and TAO will be resolved ? For those corner cases, ART will still need to be fully aware that falling back to copy or retrying is necessary. > > -- > Cheers, > > David / dhildenb > Thanks Barry