From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10810C54E41 for ; Tue, 5 Mar 2024 09:15:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8F9FA6B00AC; Tue, 5 Mar 2024 04:15:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A9556B00AD; Tue, 5 Mar 2024 04:15:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 771016B00AE; Tue, 5 Mar 2024 04:15:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 64E256B00AC for ; Tue, 5 Mar 2024 04:15:19 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 425761601FC for ; Tue, 5 Mar 2024 09:15:19 +0000 (UTC) X-FDA: 81862426758.22.D8AC903 Received: from mail-vk1-f174.google.com (mail-vk1-f174.google.com [209.85.221.174]) by imf07.hostedemail.com (Postfix) with ESMTP id 9310E40013 for ; Tue, 5 Mar 2024 09:15:17 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=N3UfpB+4; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709630117; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fFy/rhDqRjl/7+RRY4ULRREu+YI2oDuY7n8pNvXEX6E=; b=KOXGG+frnV4VLndy8ewyL8xRDmhgiIK0/z4Gb+PQZVAN8bCYvJuoKsfh5GD/qmuYimJHJN 00iOV3aGjwzNqb9/9878ANzjuzCgmI7asqILAhaVopBcA7MHZ4xQttSZ5fWmZDJIiZkC8+ vie/MizjMtSbM9fRSNKel+dDPXb8oQk= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=N3UfpB+4; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709630117; a=rsa-sha256; cv=none; b=jMPqDcbdNiixdVbOGuuVIFK9XU8R5dZrazVYq8eWCwxcbWitfGFGIka8SWUe5LQag9mRLl 7on9EyzyPRMsgUCtVehVG2i8JDFAGVc0YtSfTWftAtIdx8wnsqaQn1hozAbRJa92AvacT9 KXrWsmlOgXZSrLsX96LhvpDxsSFblOw= Received: by mail-vk1-f174.google.com with SMTP id 71dfb90a1353d-4d35123b0deso1209825e0c.2 for ; Tue, 05 Mar 2024 01:15:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709630116; x=1710234916; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fFy/rhDqRjl/7+RRY4ULRREu+YI2oDuY7n8pNvXEX6E=; b=N3UfpB+4z+i4GhzOCwX0etgXHSjn+u/JkGBP4M2ulZkgJW3/A939elOj4Yooese/FF vI8R3olGm7D9jUx4a50ase4hh7tnRfC5l9xmcSBqVabSzus9jxiGUISgiuSvewiVu8rA 2qikdwNATpJXtzDw7tpnnwMFuGY2hGHu5P417SQRY5JZepp8cwuBMreeGeGN6HejfXwh eYA+G4NCmjW6d76ZySHz+6vSd2RODdYT6I+Ph1T3GMzmOHyhhzOzPh3lH+nbjoOTYynQ AYOd59CV3F9JyrT36asHGJKd74c7SPPMkhXYmIc3zq4Mgg7Lbj02psCj2BNNFUyQvCsj nI0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709630116; x=1710234916; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fFy/rhDqRjl/7+RRY4ULRREu+YI2oDuY7n8pNvXEX6E=; b=u3uu6fbnlASYgJY+j/oa03xRerNnncjCuxHPfsqLJoV1aZ47p+1x+uyBaV4Gr6UNm5 D+AbVuAmKaOiY1HDBRPQHIX6fiG1CoFA4nfik/NMLAjQ66t9/hsoLWh+yCjG6qN5Nlyd qoumEyFMrMh1ywtBVjrR3N+3pxXdiUIDYQ8Z7QWg/1cx+k16elaselQO+2A3Oy3f0Z3/ 55/yY75zSvk2uTu2xCcBdSF7iyYpiQJvERe7pzmcRJDgvQuIW1OK1FCyJAkmvNe2fezD qA6oMYMQkT71wijkIdgsgnKevRuMgeGzVLXcTiXdengAapVmW3ZkO/OFhWnOge7akPNo vnKA== X-Forwarded-Encrypted: i=1; AJvYcCUCpjqE3M9jb983Egs8FzhB6p7Or/P1R3wIJsgSSqZuA2K59BobB27YRH4Ax/3I6mZ+ndb/Urq+GuVGaTteG7IEQSs= X-Gm-Message-State: AOJu0YwLbgI3GwqyzA8yhilMlCik37IwUGazELS6u3XOjzNSmzfMtC4t qloD5c2WESmTsji0GgxJG3Jh6dKwijF2Ip83Uipcjew8MFfodtqc2X9nnySFbqnHX4EOjuyxn2f 5RoikHGZkeYjwYIrvR+FjBWgQOII= X-Google-Smtp-Source: AGHT+IFKK6p0mlckr4RGREpNVh9sVck6+RrZOAt2murQmvxEgMevlA6APsHEhaukGn4f+zmpG9hyH+5q5dVzIyqd+uw= X-Received: by 2002:a05:6122:1d94:b0:4d3:37d1:5a70 with SMTP id gg20-20020a0561221d9400b004d337d15a70mr1270858vkb.7.1709630116640; Tue, 05 Mar 2024 01:15:16 -0800 (PST) MIME-Version: 1.0 References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> <0a644230-f7a8-4091-9d00-ded6c8c3fc19@arm.com> In-Reply-To: <0a644230-f7a8-4091-9d00-ded6c8c3fc19@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 5 Mar 2024 22:15:05 +1300 Message-ID: Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio To: Ryan Roberts Cc: akpm@linux-foundation.org, linux-mm@kvack.org, david@redhat.com, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9310E40013 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: qhqc6th3kim9dhd37m1ne9e4msx9aog1 X-HE-Tag: 1709630117-207847 X-HE-Meta: U2FsdGVkX1+51HPPD+bmupwlsUcmWxOLV+OFKGRLSCagNhUjJQCQL0sbpHb7bNyF4QEJ6PHO6L2U1yl5co6G6bIYBLzgb3O2BYm3XmKXGFU4x2g8ArlG5UJlxmC0LLhxFA9pMwc5KQEZz2+jv9hJl85WiCJcg2ujH17+C7KUuCOGvPLSqQanehNtwJL5yc4sG3iKuwJk3tdFh0ma3TC761z0SxCHiZuPsEUEh4GUuP+QrpCH2yJibb29xuurvcTRS25ePpqGSR2EW0A1Sdxe2JYpVgzR8kHY/5RLkHmXdto0zP+cq1/9T0j4bsbDu7uDYRnnImmo2yEje2A6bBisM9i32DlU1Zju9HstCTwUBxyp7KWD7osxqFksFIrYCuCYGps7S+oWDRzxh0f4APnq8PJsOkKpp+JMxVck5lpUyP+ORRlZBdxcamt8omeVgPNfI1kW4hwh2XWbqzVfmuBVDGKuX3SWMptHkzexk5Gfd3IGwbsLenqFnwdb3XyidhYqMze4jnoYanhbrG1lR14d+TUQz5m8hcDxGpX00npS6e8BJPJhknvXkzRlH7TeDcbNcEhb4Jbq0yVit7UVQpTc845Y21cnwU1v2M53DK92mykKBp8IHzyOy/W1riw+HHpplrTg8vJBvOJmsALkediUVTZo4W/onuUOezQfXy27vtjtqndNVF7pm+TzXL0MPNQJugHRX9keSYImHoXn9Mehl8EUOIPsAolWDErNPrxOeTWO0pjQO1i2KFJN51S/dnLb+Sdk0m5/tKJllrUAzlxdevKyWgO3ANIVywNeF6Gkep10USk6ZayGKSRY6oxJ7jNg+pEAAr4XMI2/M375tZKwx7/AYCafEXt6nT9+A1Aph5NxaYaZKBMD3X42mlls3XAlygukgU0qwjZDNFTwfOVXCLhTy5WpaEWU1NV/Vnx630LxEzoTskcsSRJq6AY1niICJRd30cLC3Yu3ehyOw5F v8m3YqN1 ArqiOpcgzvqDIOqQu85Islb14r1NhBp5ejFOU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 10:11=E2=80=AFPM Ryan Roberts = wrote: > > On 05/03/2024 09:08, Barry Song wrote: > > On Tue, Mar 5, 2024 at 9:54=E2=80=AFPM Ryan Roberts wrote: > >> > >> On 04/03/2024 21:57, Barry Song wrote: > >>> On Tue, Mar 5, 2024 at 1:21=E2=80=AFAM Ryan Roberts wrote: > >>>> > >>>> Hi Barry, > >>>> > >>>> On 04/03/2024 10:37, Barry Song wrote: > >>>>> From: Barry Song > >>>>> > >>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>>> PTEs modification such as break-before-make, while iterating PTEs > >>>>> of a large folio, it will only begin to acquire PTL after it gets > >>>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>>> in try_to_unmap_one(). > >>>> > >>>> I just want to check my understanding here - I think the problem occ= urs for > >>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large = folios? Now > >>>> that I've had a look at the code and have a better understanding, I = think that > >>>> must be the case? And therefore this problem exists independently of= my work to > >>>> support swap-out of mTHP? (From your previous report I was under the= impression > >>>> that it only affected mTHP). > >>> > >>> I think this affects all large folios with PTEs entries more than 1. = but hugeTLB > >>> is handled as a whole in try_to_unmap_one and its rmap is removed all > >>> together, i feel hugeTLB doesn't have this problem. > >>> > >>>> > >>>> Its just that the problem is becoming more pronounced because with m= THP, > >>>> PTE-mapped large folios are much more common? > >>> > >>> right. as now large folios become a more common case, and it is my ca= se > >>> running in millions of phones. > >>> > >>> BTW, I feel we can somehow learn from hugeTLB, for example, we can re= claim > >>> all PTEs all together rather than iterating PTEs one by one. This wil= l improve > >>> performance. for example, a batched > >>> set_ptes_to_swap_entries() > >>> { > >>> } > >>> then we only need to loop once for a large folio, right now we are lo= oping > >>> nr_pages times. > >> > >> You still need a pte-pte loop somewhere. In hugetlb's case it's in the= arch > >> implementation. HugeTLB ptes are all a fixed size for a given VMA, whi= ch makes > >> things a bit easier too, whereas in the regular mm, they are now a var= iable size. > >> > >> David and I introduced folio_pte_batch() to help gather batches of pte= s, and it > >> uses the contpte bit to avoid iterating over intermediate ptes. And I'= m adding > >> swap_pte_batch() which does a similar thing for swap entry batching in= v4 of my > >> swap-out series. > >> > >> For your set_ptes_to_swap_entries() example, I'm not sure what it woul= d do other > >> than loop over the PTEs setting an incremented swap entry to each one?= How is > >> that more performant? > > > > right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for e= ach > > PTE, if each PTE, we do lots of checks within the loop. > > > > by implementing set_ptes_to_swap_entries(), we can iterate once for > > page_vma_mapped_walk(), after folio_pte_batch() has confirmed > > the large folio is completely mapped, we set nr_pages swap entries > > all together. > > > > we are replacing > > > > for(i=3D0;i > { > > lots of checks; > > clear PTEn > > set PTEn to swap > > } > > OK so you are effectively hoisting "lots of checks" out of the loop? no. page_vma_mapped_walk returns nr_pages times. We are doing same check each time. Each time, we do tlbi and set one PTE. > > > > > by > > > > if (large folio && folio_pte_batch() =3D=3D nr_pages) > > set_ptes_to_swap_entries(). for this, we do check for one time, and we do much less tlbi. > > > >> > > > > Thanks, > > Ryan Thanks Barry