From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C6994E743CA for ; Thu, 28 Sep 2023 22:49:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E1DC38D002C; Thu, 28 Sep 2023 18:49:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DCDFD8D0002; Thu, 28 Sep 2023 18:49:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6ECD8D002C; Thu, 28 Sep 2023 18:49:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B4CED8D0002 for ; Thu, 28 Sep 2023 18:49:07 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 74D7F12026F for ; Thu, 28 Sep 2023 22:49:07 +0000 (UTC) X-FDA: 81287498334.23.79DEE1B Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf19.hostedemail.com (Postfix) with ESMTP id 9195E1A0028 for ; Thu, 28 Sep 2023 22:49:05 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yeZj2YEl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=zokeefe@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695941345; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aH1OIK0gWyk1N4xaqASo2H5s560h9u2ewBbOcfS8pdg=; b=Rpxxqu+vuIuvp/nihI4Nq7uKMqjJg+bzt3YPO34GYzHNQ8Zb2pRQ3cYMeUZ+X/v708VrSF euwBYaYB3QxrKYiJ3tx3aoTupngDDVssvX8tIriXIa5qjLjA4WiR/zQds5+FPeYdp2pU4l F0UQ1YYL6vjMwOaEUCfa0KtR97rgeTc= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=yeZj2YEl; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf19.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=zokeefe@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695941345; a=rsa-sha256; cv=none; b=mxXCp0lSpdsnWISOvjKp+H1voF/f52u97sUxZ6WUyzZvtb9z7LFq+wFjjHg4xixLAXpCPM oUdrzV5Wy4YWCAt/F2lgrHxkwlKfFdAmz8SYxBKeRSD+V167fvH0FabduUd7K3kbhh2uje 4dgpxgML4xvd2vk12IsRi62kCZSQ20g= Received: by mail-ed1-f51.google.com with SMTP id 4fb4d7f45d1cf-536ef8a7dcdso3744a12.0 for ; Thu, 28 Sep 2023 15:49:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695941344; x=1696546144; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aH1OIK0gWyk1N4xaqASo2H5s560h9u2ewBbOcfS8pdg=; b=yeZj2YElmc1bnXy212dicJea8QYdtF1jDLlYFlvaDvz88kQXIxxhS9V23T10YRzKog MuPCEesx5oRGjh2addgc50S4zree8JS3n7Ll6zPIIYl2qBenVoVz6NJy4uc2ISOn/yYg mf2LBb/KAnnLxLc1GtR5jQJcNR0VVlRNbFOig6FNi5p4VkA3p5I+hVT8zyw/0Qeg9jWU lhVPCIiJLlwlCPV6fAIof7rwyOVwy5EBrvBn171LxdHJCWlOzA96d63rjpvfoulaifUU yeFRnrq5WLIStnLq0sKI1C1B6U4uHscfGyuNwFZNhyu4e0mbMYoB3sBSdpWVoeMYwmKx NGMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695941344; x=1696546144; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aH1OIK0gWyk1N4xaqASo2H5s560h9u2ewBbOcfS8pdg=; b=wahYvolTOnC2+8KwcoAYCAZkbGeo4h+ZrDfWWVGzVheRQs4FvHVaTEiXxQyzB7TkN5 zbVzLs6DaMgkKMXXVsR1lXh79exXT77InkaqeGbqKpTkV2xKdCzthS5nIWotLIhQQyM9 QD2HdaHvdDTY7ZJPpWgGNKyPjyBgGU286J9g81T8k856JwPalFxoNaMLaPlKZDKJ9ixx Im+AwzPDuycwG5I+Ki0NrcDEHWxyCZic4UEkbl/hoHgV+RwZkCkhCPsw13/fa7VuGqzZ LP9QjmQbZUIFi1F5WgifWS8TUDkMRoZoj7SdpCoHJbMZGEJBw4YNOU5ChGc11SfAyh0z GUQw== X-Gm-Message-State: AOJu0YzgVPr5ImDcGmSAtSdvWzSNfQK5iSCpYg984HSu9vozKJzFBVBi 9+8VSv064yZaL/QBBo9lfNcCpEDBTL8dt2nDqgpyDQ== X-Google-Smtp-Source: AGHT+IEwgqOTiiLagyh/Eid9MJeL9VTSnkiJWAIbQ+ucpeAT1mLUqgG5yAYFQxQuyOWRkehH2swmIuDSqlUFnQMjTg8= X-Received: by 2002:a50:9fef:0:b0:523:b133:57fe with SMTP id c102-20020a509fef000000b00523b13357femr478875edf.1.1695941342775; Thu, 28 Sep 2023 15:49:02 -0700 (PDT) MIME-Version: 1.0 References: <4d6c9b19-cdbb-4a00-9a40-5ed5c36332e5@arm.com> <54e5accf-1a56-495a-a4f5-d57504bc2fc8@arm.com> <20230928210436.GG11456@frogsfrogsfrogs> In-Reply-To: <20230928210436.GG11456@frogsfrogsfrogs> From: "Zach O'Keefe" Date: Thu, 28 Sep 2023 15:48:25 -0700 Message-ID: Subject: Re: BUG: MADV_COLLAPSE doesn't work for XFS files] To: "Darrick J. Wong" Cc: Ryan Roberts , Bagas Sanjaya , Hugh Dickins , David Hildenbrand , Matthew Wilcox , Chandan Babu R , Linux Memory Management List , Linux XFS , Linux Kernel Mailing List , Yu Zhao Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9195E1A0028 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 61mtuj5ro1rwxgqxa4mo5kmffjzgj6t5 X-HE-Tag: 1695941345-278348 X-HE-Meta: U2FsdGVkX182AIIjyDUOPjdL0qb6jct6Mkuab3agzj8yhortsYslRsZROw5oK0DH0lGWr/Go/JFD64VrllZcMiuBXcBSw35Db2GWXjR6cxdjIhrgp5Ngw8rqvFQIyGFgWUqr/64R9kzktN7hoevGlaqPVe9nwhoFvRLG2Cu5s9cJF8WVF3xh5TUnxIhHxLqNNuK1ppYO6iXLPbsHQADobEv3y4ux0FDKwy7YGRZ6DkpjrRU+UZbtScGObBYsmk2p1ou9ZRogk+ypUubgKpFBW9DAq76Gq3X5yHIpRW/DfhF+7iktjF1CCrhv4GgKNFQPo5GRDcrmWUHjBiAIIIs4xlQWQDPXpNRiQN/XREjRpGDiegjlzqimENM6cct7qYFQgzgLLjsyJ9xXkzzC3ukKBZe1MLkKASYi/KpQb+IKewM/43zUKffMYFdZqnrCuI6PRSALKO3HFPKNBLp6Y6z4te4CzbBJHa9RZALOzSWIFJOg4e9z8bdGzcSSI6TOBrHqYCSf6QWHHu2xvDg7wImUQiYjXe+BsVfwm4KocbPdbO9Zebu4YaMxsIAPbhWxKi5YO5XD0F1Y69ZHPodWxm433Xz3ku0uhywB9p2KzRSQ93ygUiCDhB1HuREeVrKC5R8X9gxuhtImBVowA89pvXVZNVMH0VhL4rI8zYAfC5+Ji8nmHkUTT5nDyIw9hbbdugZkCPj8Rg8RcY52Pc9mkqxQJcypaX0JaZ2EkTTftHhmj1yLqxFz/y6Mwy7qUum30c/u3wQF1snXzwf/f7C9qGFxTN5KWNVM6TKXLAQ0YhAY6/GG4veNfxXO8A2rShdMMffDQP0tWeO2icxe36m5lV46d3pLoJwNu3dUpZhOMAjamfrUtuAqAmZSqKDemBLhFRM+sYRXAYSFsMDnrcFqoGsMPwBqdnydlBKnEWYXzfVHIJs6x+YNhmO7ZuglTkdzMi/vhoJl+J7s+unOgQo445P 6m8i4Kv9 6qvS1WqMKEkkAQnTYSrU2BGOQ9FhuAB027ISTdwvmSkT6MspjTRs5Xhz/2PnvWaQn2bFlNuB+uuKzhDpJiChynXSdE+7GqgIwDgQu+IXWFkQjvv3kjO1dNhuv7igXRNEZGCQcVKq2KaawK0dMO5y8qbke83T0EjUiIpYE0tBKlcdc6gAzPiPRSAMl3DFgghR4U2GU6lDt8E0/N3LcttpxYqSAEryOpAotCVTn4sPjJoDmnDNio6W2neVjHPlJcmZbxBRlH+gabmc2EaXPW9j5RU2+iC6YqskSg/oQPKKFmSoYHqw0NNbPRnbJ76tNMsYALdLv3xqv+KRE/JnWeNPSrlX+BW0s3/vZOM7eSJCmjV6IyNwB5o9JrG678nL4Ee0SyaBIsdCjOQFUf2w= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 28, 2023 at 2:04=E2=80=AFPM Darrick J. Wong = wrote: > > On Thu, Sep 28, 2023 at 12:43:57PM -0700, Zach O'Keefe wrote: > > Hey Ryan, > > > > Thanks for bringing this up. > > > > On Thu, Sep 28, 2023 at 4:59=E2=80=AFAM Ryan Roberts wrote: > > > > > > On 28/09/2023 11:54, Bagas Sanjaya wrote: > > > > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote: > > > >> Hi all, > > > >> > > > >> I've just noticed that when applied to a file mapping for a file o= n xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the f= ile is on ext4. > > > >> > > > >> I think the root cause is that the implementation bails out if it = finds a (non-PMD-sized) large folio in the page cache for any part of the f= ile covered by the region. XFS does readahead into large folios so we hit t= his issue. See khugepaged.h:collapse_file(): > > > >> > > > >> if (PageTransCompound(page)) { > > > >> struct page *head =3D compound_head(page); > > > >> > > > >> result =3D compound_order(head) =3D=3D HPAGE_= PMD_ORDER && > > > >> head->index =3D=3D start > > > >> /* Maybe PMD-mapped */ > > > >> ? SCAN_PTE_MAPPED_HUGEPAGE > > > >> : SCAN_PAGE_COMPOUND; > > > >> goto out_unlock; > > > >> } > > > > > > > > Ya, non-PMD-sized THPs were just barely visible in my peripherals when > > writing this, and I'm still woefully behind on your work on them now > > (sorry!). > > > > I'd like to eventually make collapse (not just MADV_COLLAPSE, but > > khugepaged too) support arbitrary-sized large folios in general, but > > I'm very pressed for time right now. I think M. Wilcox is also > > interested in this, given he left the TODO to support it :P > > Is the point of MADV_COLLAPSE to replace base pages with PMD-sized pages > in the pagecache for faster lookups? Or merely to replace them with > something larger, even if it's not PMD sized? Might depend on who you ask, but IMHO, the principle purpose of collapse is saving TLB entries, with TLB coalescing complicating things a little in terms of PMD-sized things or not. M. Wilcox's work with descriptor-izing folios might make a nice case for memory savings as well, down the line. > As of 6.6, XFS asks for folios of size min(read/readahead/write_len, > ondisk_mapping_length), so in theory the folio size should roughly > follow the access patterns. If the goal is merely larger folios, then > we are done here and can move on to some other part of the collapse. > > OTOH if the goal is TLB savings, then I suppose you'd actually /want/ to > select a large (but not PMD) folio for collapsing into a PMD sized > folio, right? I suppose it might make some operations easier / faster during collapse if we have less folios to process. > e.g. > > if (PageTransCompound(page)) { > struct page *head =3D compound_head(page); > > if (head->index !=3D start) { > /* not sure what _COMPOUND means here... */ > result =3D SCAN_PAGE_COMPOUND; > goto out_unlock; > } > > if (compound_order(head) =3D=3D HPAGE_PMD_ORDER) { > result =3D SCAN_PTE_MAPPED_HUGEPAGE; > goto out_unlock; > } > > /* result is still SCAN_SUCCEED, keep going */ > } > > I /think/ that would work? If the largefolio is dirty or not fully > uptodate then collapse won't touch it; and I think fs/iomap handles this > in a compatible way because it won't mark the folio uptodate until all > the blocks have been read, and it marks the folio dirty if any of the > blocks are dirty. > > (says me, who doesn't really understand this part of the code.) I think there's a couple issues with this -- for example, the head->index !=3D start case is going to be the common-case for non-PMD-sized large folios. Regardless, there is some more code in hpage_collapse_scan_file() and her in collapse_file() that would need to be updated. I'm taking a cursory look, and naively it doesn't look too bad -- most things "should just work" in file/shmem collapse path. ac492b9c70cac ("mm/khugepaged: skip shmem with userfaultfd" was merged since the last I looked carefully at this path, so I would need to spend more time understanding some changes there. So, from correctness POV, maybe there's not anything drastic to be done for file/shmem. Maybe this is a good place to start. For anon, things are different, as we are coming at the pages from a different angle, and are operating over the pmd directly. I'm not immediately sure if it makes things easier or harder. Probably harder. Can we even get non-PMD-sized large anon folios right now, without Ryan's work? >From a khugepaged policy POV, there are some questions to be answered.. but I think these mostly boil down to: scale the present/swap/none checks by (1 << order). Anyways, this isn't to be taken with much weight as a thorough audit is required to understand any subtleties lurking around. Thanks, Zach > --D > > > Thank you for the reproducer though! I haven't run it, but I'll > > probably come back here to steal it when the time comes. > > > > > > I don't see any hint to -EINVAL above. Am I missing something? > > > > > > The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() wher= e it > > > eventually gets converted to -EINVAL by madvise_collapse_errno(). > > > > > > > > > > >> > > > >> I'm not sure if this is already a known issue? I don't have time t= o work on a fix for this right now, so thought I would highlight it at leas= t. I might get around to it at some point in the future if nobody else tack= les it. > > > > My guess is Q1 2024 is when I'd be able to look into this, at the > > current level of urgency. It doesn't sound like it's blocking anything > > for your work right now -- lmk if that changes though! > > > > Thanks, > > Zach > > > > > > > > > >> > > > >> Thanks, > > > >> Ryan > > > >> > > > >> > > > >> Test case I've been using: > > > >> > > > >> -->8-- > > > >> > > > >> #include > > > >> #include > > > >> #include > > > >> #include > > > >> #include > > > >> #include > > > >> #include > > > >> > > > >> #ifndef MADV_COLLAPSE > > > >> #define MADV_COLLAPSE 25 > > > >> #endif > > > >> > > > >> #define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE);= } while (0) > > > >> > > > >> #define SZ_1K 1024 > > > >> #define SZ_1M (SZ_1K * SZ_1K) > > > >> #define ALIGN(val, align) (((val) + ((align) - 1)) & ~((align) = - 1)) > > > >> > > > >> #if 1 > > > >> // ext4 > > > >> #define DATA_FILE "/home/ubuntu/data.txt" > > > >> #else > > > >> // xfs > > > >> #define DATA_FILE "/boot/data.txt" > > > >> #endif > > > >> > > > >> int main(void) > > > >> { > > > >> int fd; > > > >> char *mem; > > > >> int ret; > > > >> > > > >> fd =3D open(DATA_FILE, O_RDONLY); > > > >> if (fd =3D=3D -1) > > > >> handle_error("open"); > > > >> > > > >> mem =3D mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIV= ATE, fd, 0); > > > >> close(fd); > > > >> if (mem =3D=3D MAP_FAILED) > > > >> handle_error("mmap"); > > > >> > > > >> printf("1: pid=3D%d, mem=3D%p\n", getpid(), mem); > > > >> getchar(); > > > >> > > > >> mem =3D (char *)ALIGN((unsigned long)mem, SZ_1M * 2); > > > >> ret =3D madvise(mem, SZ_1M * 2, MADV_COLLAPSE); > > > >> if (ret) > > > >> handle_error("madvise"); > > > >> > > > >> printf("2: pid=3D%d, mem=3D%p\n", getpid(), mem); > > > >> getchar(); > > > >> > > > >> return 0; > > > >> } > > > >> > > > >> -->8-- > > > >> > > > > > > > > Confused... > > > > > > This is a user space test case that shows the problem; data.txt needs= to be at > > > least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '= #if 1' to > > > 0, you can see the different behaviours for ext4 and xfs - > > > handle_error("madvise") fires with EINVAL in the xfs case. The getcha= r()s are > > > leftovers from me looking at the smaps file. > > >