From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB7C7E77188 for ; Fri, 3 Jan 2025 16:26:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 41CC96B0088; Fri, 3 Jan 2025 11:26:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3CC576B0089; Fri, 3 Jan 2025 11:26:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 26E286B008A; Fri, 3 Jan 2025 11:26:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 08F7A6B0088 for ; Fri, 3 Jan 2025 11:26:16 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 998101208D6 for ; Fri, 3 Jan 2025 16:26:15 +0000 (UTC) X-FDA: 82966665516.08.30382DE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 8C16140008 for ; Fri, 3 Jan 2025 16:25:19 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XCv32Vjr; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735921524; a=rsa-sha256; cv=none; b=FtG0hbvk8/d1uxDa7a7A3lnNFjxdr12LnWsvbHi8FPKx0BxbneXFJSRREtU6SsivENhGhg k8I9JPX8KR/99r7np3Rbbw5vFUvogeEB09Y6LbH9ruXGQ5q1hPTz0yFOcNseQPxnoVsgNm ymVWMoY/X2BtVXZmt0LrHsPT8Qsp/Rc= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XCv32Vjr; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735921524; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5MHNQ7x94/Z6dGv9U16Uo43D+cJQCLIgcMDc7+n4zx8=; b=CTIxvnSYpi5dGlCfvjHeXjycefEp0dCTxFv2t5+9/kqyyu6ASzfaQOz0pNPU2NmHTk/bwO FSc7TOzMsxDWKkObU7Hc/3IWi4tI3ZAS2ms7Rx+wEFxSdLVGJNeNJuaiMwGFxngiomtcUW NPryLJVrcfDJc0YlPENg6DQJ91mDA6E= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1735921572; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=5MHNQ7x94/Z6dGv9U16Uo43D+cJQCLIgcMDc7+n4zx8=; b=XCv32VjruwY2mmb9WPT7o48F94s+pognTEhoAi17UH8pnqAeLpvgD0jbv/tDFU+oLKxjzz awk2p7qsNHji7ApWchOrImPpRIW3thsXzH9iwKWLnR+ltANt0eFrhmrLvXz1NfGDQGo+rM dZ1dE0HJko58xJwrUP2k8Qd8RRu+WI4= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-341-jxbMSEYWP4C3Vfk3zW4fyA-1; Fri, 03 Jan 2025 11:26:09 -0500 X-MC-Unique: jxbMSEYWP4C3Vfk3zW4fyA-1 X-Mimecast-MFC-AGG-ID: jxbMSEYWP4C3Vfk3zW4fyA Received: by mail-qv1-f71.google.com with SMTP id 6a1803df08f44-6d8e6046f0fso208149976d6.2 for ; Fri, 03 Jan 2025 08:26:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735921569; x=1736526369; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=5MHNQ7x94/Z6dGv9U16Uo43D+cJQCLIgcMDc7+n4zx8=; b=tb2cVNCLIuTZ6CsA/ZzBGpsFcvXIEdg4QeuQu58tG64SCo+CjCo+m2iJ08ae1T8Qak 99/2K/7YcagVuSUujUs91+z8B73/F2JMjOOYSR5E/w/qnsJsiVN6fIGycrkVDpWE+dZf qGiwZRERDOx7hyAKJneFL9umF30nCG4hAHKr41DJvhgPLpu/KYhlieueTf3aZ7GJBYhC o3Lykf3a6+FHp962mRZDfwJRnS3xUguM5mELdRaDLpAvCQ0bwi9N/Jjndm5KR+Q723GV /C8axsde0hfy7nw1KhyCWF4aComrAasmpU9ExbL3Pf/fywkS9/9AqmDcArC2uQkljqnq L0Fw== X-Forwarded-Encrypted: i=1; AJvYcCXjtga/p6CkFQdqI97raB2YihgWy0EwqCaim/OVA2fhf9DECQg10h3IrUrGqkAbFSzjdYHRG6HtsQ==@kvack.org X-Gm-Message-State: AOJu0YzryrnGecB9FrD2eUOsxx79j4QOk0h/Kmaz+h5FUVRtNEgS1Dlq hIIqrQkMiyk3WvJntw/DUNWOjyNq+OQkqRr6zcJJEB9QgXlIM5PQMa1RTDmZtkRnGrdCbV8iFjl m5DaDJwyGKmXAN/zHwKqySKJ0Twh+0vwbXkpm1XhI4LT6NPgi X-Gm-Gg: ASbGncvF6now4BIQLxwxIEHzpk/RbTzlU9QIg/xdob25bUIInL3SlB4nwRArssZhBVd goXEMFq/9XlKPgHzfM2Oxcenc9VDYMoFhI/doBuf5yaaa8/Vdg1fadugbsMK4I5G6lPlYlaYaU+ ghSR6HtlWqNaeq5/PV8DdxU/I29tsqO3Uj+E+AaZ6eMORKzDTmrZaJZoWdWHJeZLvBXZnpW+hxn y8Mf2GvgEMersUS/M3UNGjbV6huZHV6OGkX8AfRfBo82Hs9ItS71W27c3okdJWw9Wk3iDNe4E0M PsKCeh0vir1Xv0g3rw== X-Received: by 2002:a05:6214:240d:b0:6d8:e5f4:b977 with SMTP id 6a1803df08f44-6dd23308369mr920827866d6.5.1735921568655; Fri, 03 Jan 2025 08:26:08 -0800 (PST) X-Google-Smtp-Source: AGHT+IGMhfRQSmYgRtjKy1hlSSHm/bvAWLo5/nax+/dFjUuYxGaLj/pusyg5rTT23tUKdcdUeFhzog== X-Received: by 2002:a05:6214:240d:b0:6d8:e5f4:b977 with SMTP id 6a1803df08f44-6dd23308369mr920827476d6.5.1735921568325; Fri, 03 Jan 2025 08:26:08 -0800 (PST) Received: from x1n (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6dd32b2be9esm120441926d6.34.2025.01.03.08.26.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Jan 2025 08:26:07 -0800 (PST) Date: Fri, 3 Jan 2025 11:26:04 -0500 From: Peter Xu To: Ackerley Tng Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, leitao@debian.org, akpm@linux-foundation.org, muchun.song@linux.dev, osalvador@suse.de, roman.gushchin@linux.dev, nao.horiguchi@gmail.com, stable@vger.kernel.org Subject: Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: lfBOCl4GeYU6pFlWAFZZJ_WMwcpEHjjIcolFXzsWFIw_1735921569 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Stat-Signature: y16oh4ekcqyxekrhw9k946izke5keq1g X-Rspam-User: X-Rspamd-Queue-Id: 8C16140008 X-Rspamd-Server: rspam08 X-HE-Tag: 1735921519-866775 X-HE-Meta: U2FsdGVkX1/1gNS563Hc4ABJK9jKXFskp0/7CFc4RmImfyLiMi5oeJ+0ZqlBwgEJ/Y4U3E/OY9O94JlCONk2oAh5UcbWAWr27sCmUYJJ0c02HksYSe2R1aoX4LO9+q5riqrCrvny4Iqk0Zus2XDGv6sRUhmVMIbbW7bMTej6CDAHFzkjOji7VxZx3LsyFqDm0IGWoC/ZXLQfOiuQl0s5QU3MkE8LHB3rjlvuV0AIeTUSnKPxkNMHt4fOwlZdHyTU9j7qgT0ZPmqKJEt03+ky9FzmgLCqdN68LsOf5MX29BndnHdKCvUdR4DiSjRrqznjXMkfGe+cqkgY/0ulHXvF8Uz76iXjBhZVZq/KkT1PaaxQOdJcGNK3pScG2cZMMVhLORPfyCdIZxTHHto8mAHSjIsVqW7zov/tvU32pjmJCHxu4B7ddK5msqhnSddSNGbGWN1w/3vR8Tjkazg1P0gXEKI0zwGxkQbJvxj7VvyLZc1XDLHCA9iTt1tqr7a95cNAwuYW0Q81dcuOFhl9q6zFIDNJPUNm4rA9Vwq4rhBHpsTwNVdCiA99OmFji8kLniCsrX5YpnK1tQfrYQtJJZ5icYYzQa7gmj6nmG1pEkamGsMSoSU8uJuRKyvkZJ56+OPOlKc0wLiuPAoRBqlYl1uza4IbPiwbeV/ieni+QDHyslcA5Sb9LyNrtUIlxg2SWavYV6yUhhiKVJjwGDBLYQkNCEIk7E4jg64wyv1H+4y684CfzoPb0deLZRxFGJvWn3Gs5CHag/tEgqcq5zl7ZB/5vjDuc0SBlQx8BI5wx9eb6Mve+Ym+q8pMWOo15ZZcbRawXoxmpWbwcGi878/aqK6we7Br76mqFAn8I+jFuIyCLEcGnj8p3KVt4dQAail56CRgK97PGV9ZyBXP3F5R05c5YjNptsXPRa18GpT53nDqiECPPmq3ei2b+a3/yu9X6K8GJS7TapWrehQ/4EmaLZ3 WYTwDgxY AKvAd95qbqfYKEisFwLViq6yA8s4HWD+IaksfAnj4YSEx6ZTd4WX3KIh+F0kdwBQiM77y8tCpkSEbxUeZB7uhQz9IZeW+LGx+f1g5H6fylNFde4i3y4pmFutyrlqyRBN4aYMniQafrlt2XqAf4yM2X+ViS0mNrDkCIVMr63s2hkDM+Caie5siyPe6Tq0/elvfCzELraHwLD4jmkTgTNfDL/6ttdtR1BngcQjME1uSjHTYwGANkc9beUGR6S9jHQJytZXs+ld+zbACc32LhKN4c9JHTy6KZudinFNIoHBepym9RDS3294BoxwO5yFZSuptwdQ33DnKD1KRw/LHRIbIZKkHngEPIHL1iGyIgbxz1FOS26Ntw9R4ucO6hgozcuDacNjzlhGEDstVjRaFk/R2/WcaqevtwcaqIxU1Vo1Tg3/XyOFYl0f19meKFzvhbc8L8F8y/1bTZ3sO0cLteuEuvxci3EflbEAKoOj06BvoKAeW6d+h1/qgCd0j3rzDLEWYkU/6lEFLBxfJIeWlSmtMqKxDVzgmPFi7HgLyChYfMJxjSmMxk9hUwuH66jXInJJ6wLb81WjH6NuUjV3LmVzJBGuGzRzXT3knJNZMghikAs7C3XepF/L6VoRJSlBRdr5LoLxt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 27, 2024 at 11:15:44PM +0000, Ackerley Tng wrote: > Ackerley Tng writes: > > > > > > > I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`. > > Yes, after looking into this more deeply, I agree that avoid_reserve > means avoiding the reservations in the resv_map rather than reservations > in the subpool or hstate. > > Here's more detail of what's going on in the reproducer that I wrote as I > reviewed Peter's patch: > > 1. On fallocate(), allocate page A > 2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested > 3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE > 4. On fork, prep for COW by marking page as read only. Both parent and child share B. > 5. On faulting *buf = 2 (write fault), allocate page C, copy B to C > + B belongs to the child, C belongs to the parent > + C is owned by the parent > 6. Child exits, B is freed > 7. On munmap(), C is freed > 8. On unlink(), A is freed > > When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on > write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is > the root of the bug. > > We should decrement h->resv_huge_pages if a reserved page from the subpool was > used, instead of whether avoid_reserve or vma_has_reserves() is set. If > avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we > won't be decrementing h->resv_huge_pages anyway. > > I agree with Peter's fix as a whole (the entire patch series). > > Reviewed-by: Ackerley Tng > Tested-by: Ackerley Tng > > --- > > Some definitions which might be helpful: > > + h->resv_huge_pages indicates number of reserved pages globally. > + This number increases when pages are reserved > + This number decreases when reserved pages are allocated, or when pages are unreserved > + spool->rsv_hpages indicates number of reserved pages in this subpool. > + This number increases when pages are reserved > + This number decreases when reserved pages are allocated, or when pages are unreserved > + h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages. I think you're correct. One add-on comment: I think when taking vma reservation into accout, then the global reservation should be a sum of all spools' and all vmas' reservations. > > More details on the flow in alloc_hugetlb_folio() which might be helpful: > > hugepage_subpool_get_pages() returns "the number of pages by which the global > pools must be adjusted (upward)". This return value is never negative other than > errors. (hugepage_subpool_get_pages() always gets called with a positive delta). > > Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other > than errors). > > If the return value is 0, the subpool had enough reservations and so we should > decrement h->resv_huge_pages. > > If the return value is 1, it means that this subpool did not have any more > reserved hugepages, and we need to get a page from the global > hstate. dequeue_hugetlb_folio_vma() will get us a page that was already > allocated. > > In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1 > page, and there are no available_huge_pages() left, we quit dequeueing since we > will need to allocate a new page. If we want to avoid_reserve, that means we > don't want to use the vma's reserves in resv_map, we also check > available_huge_pages(). If there are available_huge_pages(), we go on to dequeue > a page. > > Then, we determine whether to decrement h->resv_huge_pages. We should decrement > if a reserved page from the subpool was used, instead of whether avoid_reserve > or vma_has_reserves() is set. > > In the case where a surplus page needs to be allocated, the surplus page isn't > and doesn't need to be associated with a subpool, so no subpool hugepage number > tracking updates are required. h->resv_huge_pages still has to be updated... is > this where h->resv_huge_pages can go negative? This question doesn't sound like relevant to this specific scenario that this patch (or, the reproducer attached in the patch) was about. In the reproducer of this patch, we don't need to have surplus page involved. Going back to the question you're asking - I don't think resv_huge_pages will go negative for the surplus case? IIUC updating resv_huge_pages is the correct behavior even for surplus pages, as long as gbl_chg==0. The initial change was done by Naoya in commit a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count"). There're some more information in the commit log. In general, when gbl_chg==0 it means we consumed a global reservation either in vma or spool, so it must be accounted globally after the folio is successfully allocated. Here "being accounted" should mean the global resv count will be properly decremented. Thanks for taking a look, Ackerley! -- Peter Xu