From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9214E743C5 for ; Thu, 28 Sep 2023 21:04:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C6CF36B0087; Thu, 28 Sep 2023 17:04:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C1B126B0092; Thu, 28 Sep 2023 17:04:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ABBF76B0095; Thu, 28 Sep 2023 17:04:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 95BD66B0087 for ; Thu, 28 Sep 2023 17:04:40 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 493E980EE6 for ; Thu, 28 Sep 2023 21:04:40 +0000 (UTC) X-FDA: 81287235120.05.4A72643 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf16.hostedemail.com (Postfix) with ESMTP id 7E1B818001B for ; Thu, 28 Sep 2023 21:04:38 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=GifZoq+S; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf16.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695935078; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vvaGZz0S2mve5PqT9OEdKb9K0YGDOdRou1VXcHhq0gY=; b=nksaCkRol/n0LmqinW9S9NL4BP+s7J4XWDYjxeCCeN289KXKlRB8rpio8rn6ToeVpRcR9v eKK1reoG5e0ix8tkDvgVZX24OUemc+uSSBJ/OiNUSprHdM3VEWRd3U+xuZrwJ5VADHIAY9 /FxP6ga+G78AFKbGgBCAsUfOhxuZjOA= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=GifZoq+S; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf16.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695935078; a=rsa-sha256; cv=none; b=L9q4/KmEgXvs801LPvpJ7ZZJO/NnDYlUf3yluNP9a7tIVreDuz0pBaeeiUgABuOIDPdUp7 NKCpqlcMzmfHgOZG9RrAqtySvMTQ50ofFn2jjJCnjnVL6NFZvBb4QB34brPl2xYDRvSuPs x9bUykKBItu1Jd+lVUGAHRZBCZjZSqA= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 627C061C31; Thu, 28 Sep 2023 21:04:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EFA80C433C8; Thu, 28 Sep 2023 21:04:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1695935077; bh=M/5ZU8g6yA4315it+eEcyz6qGTDOsNSnU7hBp/ukGEI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=GifZoq+S91dlVSa8ba0LZTxOnoK6itojp7KAx0aHws8HEpKAazauz/F7V6ey4j4aZ ReTk2p58whVPBqxUi1p8nECtD32BcYK73OW0ypmbynywHszkqFvjSs9Oanu2jEsaJf G2yVpCNxUtxC8A1bekl8OOXosCFSPGeKVbuU8aa2DjyYdH093YRAdDshO9jwS33at2 oxSPfIWR3UiZOcRFOkd9hnM9jS7+/E7fgt2pcbQ0kh3te4uvyOrY27twx+VsJBFGTG /48cAHA4N/1KsXPkSE+TxA7p91cxibX7Ttw+GQgRBcX8caNe3WmJLWqoObJ0VsYqM2 hLjyiaOnF9dVg== Date: Thu, 28 Sep 2023 14:04:36 -0700 From: "Darrick J. Wong" To: Zach O'Keefe Cc: Ryan Roberts , Bagas Sanjaya , Hugh Dickins , David Hildenbrand , Matthew Wilcox , Chandan Babu R , Linux Memory Management List , Linux XFS , Linux Kernel Mailing List , Yu Zhao Subject: Re: BUG: MADV_COLLAPSE doesn't work for XFS files] Message-ID: <20230928210436.GG11456@frogsfrogsfrogs> References: <4d6c9b19-cdbb-4a00-9a40-5ed5c36332e5@arm.com> <54e5accf-1a56-495a-a4f5-d57504bc2fc8@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 7E1B818001B X-Stat-Signature: x1cyr8kqmb7hrxgcizrybbtnz3egyc3i X-Rspam-User: X-HE-Tag: 1695935078-131309 X-HE-Meta: U2FsdGVkX1/tPARGDkHgBM7VgAwzA6+GtnQclb7Uc9hKV3xryzvAWhUYJaQPj8oqJ3NsNB0jUqQlSIn8tr6ScZJxQrbseIYDNxDRGXPSGjbc8vFWa1Mp3w40qWS91qLn9miQTCfS/XPeZfGUoIr9V5Rqtv9JPwPGTEdGy41S4wQDMbvQtlyBNAKqZLWQbnv8X244qo7HrIojy43n6D8YmE/SrZ8xwW/yIej6I1hybkIMUb/Ezww9/xyNElRSV7dkzGZcTMrMLwLpSYIN8S5pUh9qPUcpRf0s7ptMTF+Qd3CEjAd2NUBW7B7GXn/yfVcgRukeRul/J3MMzQ1nm5XA5pMBrHu/KetzYnhyZiQgUnLzAOqjqJqQDsEL6T8CPmXwHO5iqoE4h5hSlAWUq44lUYS8bM5S579Oi9xxrS1mvlaa1h+Me/0Ie8yVx74GI38q1f5OoICd9jc8RBBAOAWfp9rk3sdlFA6PQUgqTKEYNklmnVZdK2wibrmUmDMawFrvfDvqApqLvo5oYneK13nfBeoe01zsI5M4MgLLNZA2lNjnyeBGr8wFuF4nutBWO3xPNLr2Dhn0FOPwCGdDjjyLXh6cy3qjNUQeduqgfmsRpJ0MMpSap1HVOcvor23+7XSZnKkHt41jRlpRFlF2TvFzFUErlG0z+OPNmlsrAeCW4le6xcKXzj5VXXjvJhu29SZ833xMCKqRRbOxQeBAVNJPeAJGUnTlBLIu2ApI4lS8+MMgH4jaEiicGYQB4tGsGrNNPn0nx2xJFaGJU7ScATBObui7agq0aG5qEFy/tcl17Zq9zKXdCEVvqsPAWzYdwizj03hNk7BGYVCn5zC6PIQj6UOemOHRGEG30PL4IziIxvKOO2ego+DusUgDdXptv69RrmX8IdQcb8ujSv48F8JyWw/RhBJdBLpkTCX1jx+B06ZySsqbMjnGNxjHisK8zMMoA/ogc1hkDDeXuAIkqkv w+weLcgg u/MWIpSj/NQLyLa5i7/h5d93rTcEDSmYHh4Ye4V4un+oVkMx5HhGE0V4RfV2UcQOQT3u4XBEvxckcFqCpashJGtNCdARbDse4l3wMKnQcPum79XE2AoMxXcoVWsIcyujiGMw+UFnuHG2kbi40+P8g/jibDQ99ZJCegtsB0jj1LjZL/L1lRhp1gys+rrmHOQkLo5jZ6Gwb5pxHgShUpSZcaMQz1CnbAGTaZrGgos7+9oVPgsYk84++hOI+X0UG6Bdfr+XXIr/Ic9ewZ0Jte6Boswnsa67KmTtPu8RYq5v2NOgx1og= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 28, 2023 at 12:43:57PM -0700, Zach O'Keefe wrote: > Hey Ryan, > > Thanks for bringing this up. > > On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts wrote: > > > > On 28/09/2023 11:54, Bagas Sanjaya wrote: > > > On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote: > > >> Hi all, > > >> > > >> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4. > > >> > > >> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file(): > > >> > > >> if (PageTransCompound(page)) { > > >> struct page *head = compound_head(page); > > >> > > >> result = compound_order(head) == HPAGE_PMD_ORDER && > > >> head->index == start > > >> /* Maybe PMD-mapped */ > > >> ? SCAN_PTE_MAPPED_HUGEPAGE > > >> : SCAN_PAGE_COMPOUND; > > >> goto out_unlock; > > >> } > > > > > Ya, non-PMD-sized THPs were just barely visible in my peripherals when > writing this, and I'm still woefully behind on your work on them now > (sorry!). > > I'd like to eventually make collapse (not just MADV_COLLAPSE, but > khugepaged too) support arbitrary-sized large folios in general, but > I'm very pressed for time right now. I think M. Wilcox is also > interested in this, given he left the TODO to support it :P Is the point of MADV_COLLAPSE to replace base pages with PMD-sized pages in the pagecache for faster lookups? Or merely to replace them with something larger, even if it's not PMD sized? As of 6.6, XFS asks for folios of size min(read/readahead/write_len, ondisk_mapping_length), so in theory the folio size should roughly follow the access patterns. If the goal is merely larger folios, then we are done here and can move on to some other part of the collapse. OTOH if the goal is TLB savings, then I suppose you'd actually /want/ to select a large (but not PMD) folio for collapsing into a PMD sized folio, right? e.g. if (PageTransCompound(page)) { struct page *head = compound_head(page); if (head->index != start) { /* not sure what _COMPOUND means here... */ result = SCAN_PAGE_COMPOUND; goto out_unlock; } if (compound_order(head) == HPAGE_PMD_ORDER) { result = SCAN_PTE_MAPPED_HUGEPAGE; goto out_unlock; } /* result is still SCAN_SUCCEED, keep going */ } I /think/ that would work? If the largefolio is dirty or not fully uptodate then collapse won't touch it; and I think fs/iomap handles this in a compatible way because it won't mark the folio uptodate until all the blocks have been read, and it marks the folio dirty if any of the blocks are dirty. (says me, who doesn't really understand this part of the code.) --D > Thank you for the reproducer though! I haven't run it, but I'll > probably come back here to steal it when the time comes. > > > > I don't see any hint to -EINVAL above. Am I missing something? > > > > The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it > > eventually gets converted to -EINVAL by madvise_collapse_errno(). > > > > > > > >> > > >> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it. > > My guess is Q1 2024 is when I'd be able to look into this, at the > current level of urgency. It doesn't sound like it's blocking anything > for your work right now -- lmk if that changes though! > > Thanks, > Zach > > > > > >> > > >> Thanks, > > >> Ryan > > >> > > >> > > >> Test case I've been using: > > >> > > >> -->8-- > > >> > > >> #include > > >> #include > > >> #include > > >> #include > > >> #include > > >> #include > > >> #include > > >> > > >> #ifndef MADV_COLLAPSE > > >> #define MADV_COLLAPSE 25 > > >> #endif > > >> > > >> #define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) > > >> > > >> #define SZ_1K 1024 > > >> #define SZ_1M (SZ_1K * SZ_1K) > > >> #define ALIGN(val, align) (((val) + ((align) - 1)) & ~((align) - 1)) > > >> > > >> #if 1 > > >> // ext4 > > >> #define DATA_FILE "/home/ubuntu/data.txt" > > >> #else > > >> // xfs > > >> #define DATA_FILE "/boot/data.txt" > > >> #endif > > >> > > >> int main(void) > > >> { > > >> int fd; > > >> char *mem; > > >> int ret; > > >> > > >> fd = open(DATA_FILE, O_RDONLY); > > >> if (fd == -1) > > >> handle_error("open"); > > >> > > >> mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0); > > >> close(fd); > > >> if (mem == MAP_FAILED) > > >> handle_error("mmap"); > > >> > > >> printf("1: pid=%d, mem=%p\n", getpid(), mem); > > >> getchar(); > > >> > > >> mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2); > > >> ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE); > > >> if (ret) > > >> handle_error("madvise"); > > >> > > >> printf("2: pid=%d, mem=%p\n", getpid(), mem); > > >> getchar(); > > >> > > >> return 0; > > >> } > > >> > > >> -->8-- > > >> > > > > > > Confused... > > > > This is a user space test case that shows the problem; data.txt needs to be at > > least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to > > 0, you can see the different behaviours for ext4 and xfs - > > handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are > > leftovers from me looking at the smaps file. > >