From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 920A8C5AE59 for ; Wed, 4 Jun 2025 01:04:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0F366B053C; Tue, 3 Jun 2025 21:04:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EE7366B053E; Tue, 3 Jun 2025 21:04:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DFCF86B053F; Tue, 3 Jun 2025 21:04:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BB2416B053C for ; Tue, 3 Jun 2025 21:04:25 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 40AB8141B7B for ; Wed, 4 Jun 2025 01:04:25 +0000 (UTC) X-FDA: 83515922490.28.32C946F Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by imf10.hostedemail.com (Postfix) with ESMTP id 659F5C0008 for ; Wed, 4 Jun 2025 01:04:23 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ViHUFJqM; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of huangzhaoyang@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=huangzhaoyang@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748999063; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tRRxKhekxFrTcfT02DzJfOp92Jcsrwh9YVI629+T0/w=; b=uqDeZpa0//RMPXdU3aHs7U221mCR7pZfKKPh4jEgcKQhtAnSExYedfeXQlAcvOiMwNEYyY 2LFSJRpv6hItAn13pKcIbDRsGST04Ge9cq2Kv7tscFTTKxJcXsOM9hUADgfhgBKaq7phiE 3asDpZcP28BbmhhuglgGLqneKPTEEsw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748999063; a=rsa-sha256; cv=none; b=G/YTuCrbDXUIyhU+MXbagHEatn00pF1ryywR4lglF7wAG+3OkM8EXiIVYpUun8cBJ7V0Pb m9vgFJkFWoNIcMj1YiRWXQWcnNG6lTqxRoElljgb31FXcMbvsUUL60NaMOvi+Kr1VzcGMs l5X3ppyBP+sJHhGG4iy4NsUaLF/+whc= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ViHUFJqM; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf10.hostedemail.com: domain of huangzhaoyang@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=huangzhaoyang@gmail.com Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-32a6c473e28so3637031fa.0 for ; Tue, 03 Jun 2025 18:04:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748999061; x=1749603861; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tRRxKhekxFrTcfT02DzJfOp92Jcsrwh9YVI629+T0/w=; b=ViHUFJqMhmJciAXMLShZgNXXrEiJCDsJiumAGPQ/ysiB/CgDAFT2Zq+nZLqeclY99j CT6pPMuJkXaf8qeLCp8szyCt4PhrvY3oiSHmxGLEi4SeZSQHocHZVa8NtGgapPJIxdn+ Dq56Y+wH20/Wkmi8zxhzLnXkfmltZH9WNtUWkE+4j/oizZrKIl0438XJlocgIBUoA269 zYIWTKzcWQDklBa8ILugH7QRFY5PxYzgqXtKWig0wZVgIXJSHTv0Pfx6fvUMmW4K1pSl C7UhvKvmljW8/hOWBnyiJRyoUlyeKqJAWRVRfWDdpudQ42AE6t8UniaOSEN9QbRdOUgK trVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748999061; x=1749603861; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tRRxKhekxFrTcfT02DzJfOp92Jcsrwh9YVI629+T0/w=; b=rDgqL1NwLb1n3WRP+MLCV8amT3d8AL2aE/4Tkl7G6AMLnI9JAzKEj8bSEYPvn+Oj5V vycS6b/aGz8EeDlokelaNNFondVqNrbSPIa7u8RdAzkAEuqguo1zI2N3qKMbgQgjUeqT bs9ZR5AGd7TIICc5cn3Pis4MhgJc92B2d9qeLmSY8t4Ne06wAbHBBwP6Yfad5riIuRSO 8rYjL4GC7vHxDoH75+8UCb0Tsq59UInSyw/9PEV9NBjSyQzKSeRJQTezmtQzmYIX6GQB H3reS+Vc/8xPCAqzwtS0z6gME1rEOubRKVezY4TgJw967MWpwacW0LEcqeKTSwwdtnDf n0QA== X-Forwarded-Encrypted: i=1; AJvYcCXEHgFrKTtbE7BONzEhC8JPBnL6ioZLVEbVy3H1E1fOuzxEG38d0oY29bPRqjUW91/27gpYPsPS4w==@kvack.org X-Gm-Message-State: AOJu0YylMQjTzsYf2wAaYb4cvjvaSo81AfrgQO7/rZ5O1dd6yCPfkPMk OX8L3CVfPXvfZPcxHU0ml6eCNuZxK/M325xVQp/YnIUUeHhEWUx6ioM7njDWG5k2YgNVObuokVA yKoa9pBgHwrmS7kezSCGim2PqxVClzyc= X-Gm-Gg: ASbGncsGGXgutehTHo8r6PMDDZR8IUQ3u0Ne/5jgDiuxy6Got/5b/5qRGU2cb7Oh952 jc+eZO9yFTMJUHCL9tcIeCqrHPWODyaghmylXTHZDUFaBdfjboQ7MmuRPYpUUIsgaC6fmOBbD/4 nxDgwmmqHR9mWgl0P3KJF5+SaA9wRvfaNiJJ9eXxo4aPFd X-Google-Smtp-Source: AGHT+IFG/29qXgiNYkHK0RkdOm93iJ+RvSXtK+Q/gDawHRzR0xUJRQTH2JLW8ITL1Jc3sC4PK7E1Ud3FbtP6itdbE/E= X-Received: by 2002:a05:651c:30c8:b0:32a:6236:7094 with SMTP id 38308e7fff4ca-32ac71c3c1cmr830381fa.9.1748999061224; Tue, 03 Jun 2025 18:04:21 -0700 (PDT) MIME-Version: 1.0 References: <4e2305d6-b067-4963-b16a-367a254d22c1@nvidia.com> <20250526074845.GA2848800@tiffany> <20250526093258.GA3489925@tiffany> <20250526111744epcms1p89d664f5cebd1e690730f32b66c24e3c0@epcms1p8> <20250528012329.GA1545287@tiffany> <20250528033626.GA1607193@tiffany> <223cf8fc-7743-497f-893c-37ac689af002@redhat.com> In-Reply-To: <223cf8fc-7743-497f-893c-37ac689af002@redhat.com> From: Zhaoyang Huang Date: Wed, 4 Jun 2025 09:04:09 +0800 X-Gm-Features: AX0GCFskgvl1PSrLCFs0vLmKZj-CZD0geGi8VKipUxgUq70tF-hdGsc5JLaE1Uo Message-ID: Subject: Re: reply: [RFC] pin_user_pages_fast failure count increased To: David Hildenbrand Cc: Hyesoo Yu , jaewon31.kim@samsung.com, John Hubbard , "zhaoyang.huang@unisoc.com" , "surenb@google.com" , "Steve.Kang@unisoc.com" , Jaewon Kim , "linux-mm@kvack.org" , Jang-Hyuck Kim Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 659F5C0008 X-Stat-Signature: is34tesxpu4qjm4rgzjay6husze8wxni X-Rspam-User: X-HE-Tag: 1748999063-864418 X-HE-Meta: U2FsdGVkX1+VlMEA5N8k4XeKC8dqhV8b428cnsF7SxPC0hu43bgzjK8Y8H2WP46qtflTkomVpehTWQ5mZIVGaxkpsAlTKq236ld9HQ6/tgKVe5DIWnQTTIwwfKlyaVIoGFrx7tPDbXMLbDL8GYYs6d+QK/ZkcC8peedaItPHiv7vj6xJ8UTbXitdsCL+CXTr4iZ1CrCbI3O89Z6O2GYuSa3mKyvIxQU1IrJamSiup6q7pQ5McqNyxdwPb3E9v/RqwAd3sQBlpaLuE/n2sg5lk/64Jl6iICNjFrPBsaMy5eb3R6I8r5SeESavZVOZZc1GKkdGauX6Vh3qud6XSflkrW7/aqQIean/8FLTHgX87fLEBpmp5hTTDrhmyiDbJwezw0EoRfkjGFT/PplgdCUEXvJwkFTx0QizSeeYSG67LW8Up0VhChw9ZyteNzqJ+1YDaLLBTMBYenYkbXm4R183A+6FgMf2xiFBxY1uOyznMx9bRHVFaQ54/s8LcGqmTQ1VGeit9UAAw7xc5DHFrxNv/yOiusBtJTMMIshXMFwz+ewcbV4rL6KARNWyxsmTXUgfiox72JIudBtTPeywIflQb0IBadpkMwBufckqSs/PZHtQyTul4xu+WnnDNk2fCdkJiSlUJEIG78acjF3mE3VkodM1DNTchlCgHPWAD+4dqY/Q7S6BAhZGZFIVVq7z/kee/fnCzjxBnZ+23hulP8BSKau4pW1otnIdJRiF3TVqpS2hUJJ2Z4XwiT4VopbveiuWExBaIFFOFeYfEmugMc/3N1J9mfJqrDi/b3jsdwT9ncNz0csBxnG7H61QL0bLKTSOJfo8EIbOWG8cYBL9yrT6VbS/dGm7xIehEWRGU6qvp1Kkut+k8JhDvxB6ASxA9klOllF+wi4Qp6YjHb276HDqUwxFW1IecGfuvlLlqDWnehgyk9BW11w8UmcJUN8GgKYjYygtIdLrZnNB1iN6gQ2 qf9SZGTC EoorLGMveNYXiZsh70x7ps6I2JmPxt5AnseCh6pUStiDeFeR7/m7tjCZ4H/KIW/y/CagBeQqkqfYUTZEGfQyrjelw1wMPQjUOUEwt5XYCcyHqQPtQGMOxtAvs/APd+mXxGqq8oj7zC3CpoUI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 9:12=E2=80=AFPM David Hildenbrand = wrote: > > On 28.05.25 12:59, Zhaoyang Huang wrote: > > On Wed, May 28, 2025 at 3:55=E2=80=AFPM David Hildenbrand wrote: > >> > >> On 28.05.25 05:36, Hyesoo Yu wrote: > >>> On Wed, May 28, 2025 at 10:49:36AM +0800, Zhaoyang Huang wrote: > >>>> On Wed, May 28, 2025 at 9:25=E2=80=AFAM Hyesoo Yu wrote: > >>>>> > >>>>> On Mon, May 26, 2025 at 07:49:57PM +0800, Zhaoyang Huang wrote: > >>>>> > >>>>> Hello, Zhaoyang. > >>>>> > >>>>> I don't believe commit 1aaf8c was just intended to prevent an infin= ite loop. > >>>>> The commit was introduced to allow pinning CMA memory in the pKVM o= n AOSP. > >>>>> > >>>>> That leads me to question whether the assumption that CMA can be lo= ng-term pinned is actually valid. > >>>> That depends on the user of CMA, yes for my scenario since it worked > >>>> for the guest os. For common scenario such as the file/anon mapping, > >>>> the page will be judged as unpinnable for long-term and be migrated > >>>> out of CMA area. > >>> > >>> Your scenario and the common scenarios can not be distinguished from = the kernel API's perspective. > >>> Even in common cases, the page may be in a non-LRU state temporiarily= , and in such situations, > >>> pinning CMA can lead to bugs - we've encountered multiple issues beca= use of this. > >>> > >> > >> Right. We just disallow long-term pinning CMA pages, because we don't > >> know who the real owner is that would be okay with long-term pinning t= hem. > >> > >>>>> > >>>>> In my opinion, it might be more appropriate to revert that commit 1= aaf8c and instead ensure > >>>>> that pKVM avoids using CMA for memory that requires long-term pinni= ng through GUP ? > >>>> It is not a pkvm issue but a defect of applying FOLL_LONGTERM over > >>>> non-LRU CMA pages. > >>> > >>> In include/linux/mm_types.h, the CMA should be migrated when FOLL_LON= GTERM. > >>> > >>> * In the CMA case: long term pins in a CMA region would unnecessarily= fragment > >>> * that region. And so, CMA attempts to migrate the page before pinni= ng, when > >>> * FOLL_LONGTERM is specified. > >>> > >>> Given this, would it make sense to avoid using FOLL_LONGTERM in this = code path ? > >> > >> If something is unbounded in time, FOLL_LONGTERM is the right thing to= use. > >> > >>>>> > >>>>> Alternatively, instead of changing the current logic that prevents = longterm GUP from pinning CMA, > >>>>> it would be better to propose a new patch that specifically address= es the pKVM scenario like adding new FOLL_flags ? > >>>> I don't think so. pin_user_pages is an exported API which can't make > >>>> assumptions over the caller. > >>> > >>> My point is not to base the patch on assumptions about the caller, > >>> but to define a clear mechanism that ensures safe behavior in the int= ended scenario. > >>> > >>> For example, you can add FOLL_NO_MIGRATION and skip to migrate unpinn= able pages. > >> > >> Not sure which exact semantics you have in mind. But failing if we wou= ld > >> have to migrate might be ok. Not sure if the caller should worry about > >> that, though: the caller should not have to worry about page placement > >> in general. > > With going over the whole thread, I think the root cause is > > collect_longterm_unpinnable_folios() hit the race window between > > lru_add_drain_all() and folio_isolate_lru() by chance and returned > > with ret=3D0 which finally have the CMA page pinned, right? However, I > > find the proposed patch below will fail the PKVM > > scenario(FOLL_LONGTERM set with non-LRU CMA pages) again as the CMA > > pages never go to LRU which will have the __gup_longterm_locked loop > > in do while(ret =3D=3D -EAGAIN) as it did before 1aaf8c. I think the ke= y > > point is to find a way to distinguish the temporary(on the way to LRU) > > and permanent CMA pages within collect_longterm_unpinnable_folios. > > > > static long > > check_and_migrate_movable_pages_or_folios(struct pages_or_folios *pof= s) > > { > > + bool any_unpinnable; > > LIST_HEAD(movable_folio_list); > > > > - collect_longterm_unpinnable_folios(&movable_folio_list, pofs); > > - if (list_empty(&movable_folio_list)) > > - return 0; > > + any_unpinnable =3D > > collect_longterm_unpinnable_folios(&movable_folio_list, pofs); > > + if (list_empty(&movable_folio_list)) { > > + if (any_unpinnable) > > + pofs_unpin(pofs); > > + return any_unpinnable ? -EAGAIN : 0; > > + } > > > So, what's the status of that? We should fix it upstream (*not* caring > about controversial out-of-tree pkvm issues). Leaving aside the pkvm issue, we should also care about the CMA pages mapping to VM by special driver which are intended to be long term pinned (actually they are fetched by cma_alloc and then mapped to VM instead of alloc_pages during normal page fault). Could we distinguish them by the patch below based on 1aaf8c122, that is, this kind of pages is not on page cache and have equaled refcnt to mapcount diff --git a/include/linux/mm.h b/include/linux/mm.h index bf55206935c4..1ae251cd194a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2099,7 +2099,13 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio) #ifdef CONFIG_CMA int mt =3D folio_migratetype(folio); - if (mt =3D=3D MIGRATE_CMA || mt =3D=3D MIGRATE_ISOLATE) + /* + * CMA pages mapping to VM by special driver may not on page cache which has NULL folio->mapping and equal + * refcnt to mapcount + */ + if ((mt =3D=3D MIGRATE_CMA || mt =3D=3D MIGRATE_ISOLATE) && + (folio->mapping !=3D NULL) && + (folio_ref_count(folio) !=3D folio_mapcount(folio))) return false; > > -- > Cheers, > > David / dhildenb >