From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5BEFEE7718B for ; Fri, 27 Dec 2024 23:15:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B4E46B0083; Fri, 27 Dec 2024 18:15:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 364C56B0085; Fri, 27 Dec 2024 18:15:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22C0F6B0088; Fri, 27 Dec 2024 18:15:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 04C7D6B0083 for ; Fri, 27 Dec 2024 18:15:48 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id ADC441A0656 for ; Fri, 27 Dec 2024 23:15:48 +0000 (UTC) X-FDA: 82942297914.02.C1F1B05 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf06.hostedemail.com (Postfix) with ESMTP id 5E823180005 for ; Fri, 27 Dec 2024 23:15:16 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=K+bNRhcb; spf=pass (imf06.hostedemail.com: domain of 3ITVvZwsKCIooqys5zsC71uu22uzs.q20zw18B-00y9oqy.25u@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3ITVvZwsKCIooqys5zsC71uu22uzs.q20zw18B-00y9oqy.25u@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735341296; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=pYN9ztGBOKvJNFzrcQMWsC4rhR3Yt3k6FRUdi+tvDCY=; b=JlkOo7fEN4gSD0YqSkhc95XLquMrqsPHNGs/nRghHhfwaYBHD+k0qanWGy3JrxZkbTN4ps uE+FvduYEZ1PMSBxbhKKZrVShs7yoD8xihopqIb17aRodLm6+AwnbFhxDuJqeZcN3bD3iL IptoYRhJrsWBEw5ro/J/BGaBJ4Anue4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735341296; a=rsa-sha256; cv=none; b=zIMPbORFnijsGCI2Oi0IT/EPP23Otr2sxLG9RrFZU1nMhpJW6XzAgzfN4yartLH0ep+yMl xjFWJXRT7TnsSvhgWayT27Y3lOKHjP668n0CQB91Vwnfqs0GTRD8I+XxW0/33AritxjAJQ smugFNWCf4bYP+CVyI/LeZ1UX53zhYs= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=K+bNRhcb; spf=pass (imf06.hostedemail.com: domain of 3ITVvZwsKCIooqys5zsC71uu22uzs.q20zw18B-00y9oqy.25u@flex--ackerleytng.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3ITVvZwsKCIooqys5zsC71uu22uzs.q20zw18B-00y9oqy.25u@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2efc4196ca0so9771511a91.2 for ; Fri, 27 Dec 2024 15:15:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1735341345; x=1735946145; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date:from:to :cc:subject:date:message-id:reply-to; bh=pYN9ztGBOKvJNFzrcQMWsC4rhR3Yt3k6FRUdi+tvDCY=; b=K+bNRhcbx+Ou26R8Df6TWHUJ6i+/scWqNqVS0TD9jAsXm88N0hwMj23WWbU5GydBKO RW50WE6p8MNBVPEXIz2uDzhoV/YofI+5cHLW8oappBBqVgqlNRdo1OVgKzM92bNv2Ftr hTdsi0hc+2uK//5U3ZgypF+BhPRh2EYZT90XPIvnEr8Bh9FOoz0cZ32FpbUpppIyPYWk OQWebJjO2y4EwWmhaxG8JQGfdluKgvvtUYzSiSemxy2tjxXj1Hw15QRV2UReF/7/Fnsn jOsTYkepQGOqSzwBVstV5KWFfCYsiBvgqwdEdkdG39Fi3RjV+K6gx6gCAYEYvA7VJ5NK +ghw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735341345; x=1735946145; h=cc:to:from:subject:message-id:mime-version:in-reply-to:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pYN9ztGBOKvJNFzrcQMWsC4rhR3Yt3k6FRUdi+tvDCY=; b=dIlz9P9hstJHV0Ws/XG5fQJXzyC/6mEU511Ov6Gqi//L3+f0+zB6i8vDS8IHv0ptQf /PdCNYrGG7zkQ9ZU5HD0A22u0khNBla1XauNvt0GFm3avfRTA6eJ87pWjF+2G3b727Ho 4Tzgq7I2BAOH+3aNrlCXq9B+GbzA75sQoaHcHbsl1EKMxZMASMNwfAVDlYDcw9vyzt/C bVzoncW68d2RlPyBuAbSAOS/ax0AiH3XC0sV5TSxT+2PKKuhgtlzoolfyIQWPlX4DMMO nBZb++nFz05wVUfUzoEnwu7ImBauaUg1jiFemFL4xredGprh7R3EGX5xX+kU7qsc5AhA zv5w== X-Forwarded-Encrypted: i=1; AJvYcCUOBUk0wj8LlsZ5irH1Tar9Qmt2H56so+my4v3R1L9S6vM/nur4li3HyGLM09gzA1Wn6/xWxNi1YA==@kvack.org X-Gm-Message-State: AOJu0YwUtkqj35GXlj7ZORVe4annGjDS8BimA7RcO6ILBpSmbNGIzIcR lCC3DVJ1RY0dcDrF6hjH1bllHuDPPIomkIdUzaR1MnvZ8x8poMh+Kb+cJldZtEwYkEfybUrsgCJ kbsJwgSp4FJ1a6EfDIa5gLQ== X-Google-Smtp-Source: AGHT+IGYAECDHJ4KsM8Z6EpTiSh3tBGqcEn9YutEFxQ53dQTWNAXX4rILEOQyv0ufEIHQKSHvStzVsVLF+tXsGrdJA== X-Received: from pfwy41.prod.google.com ([2002:a05:6a00:1ca9:b0:724:f17d:ebd7]) (user=ackerleytng job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:2c86:b0:725:cfa3:bc76 with SMTP id d2e1a72fcca58-72abdd4f2a7mr46778692b3a.4.1735341345609; Fri, 27 Dec 2024 15:15:45 -0800 (PST) Date: Fri, 27 Dec 2024 23:15:44 +0000 In-Reply-To: (message from Ackerley Tng on Wed, 18 Dec 2024 14:33:43 +0000) Mime-Version: 1.0 Message-ID: Subject: Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool From: Ackerley Tng To: Ackerley Tng Cc: peterx@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, riel@surriel.com, leitao@debian.org, akpm@linux-foundation.org, muchun.song@linux.dev, osalvador@suse.de, roman.gushchin@linux.dev, nao.horiguchi@gmail.com, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 5E823180005 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: itptmr3uourgknrhf17w1r8xtq8tc181 X-HE-Tag: 1735341316-17118 X-HE-Meta: U2FsdGVkX18QGRkvmxz6n/VdQEipSB93+FEjOaGBzjteTiCfQuWK3KvBv1pH/ieDLkKl/Zgy7APBx5ZhAAd59clmxHC09XEKKRyuDeb8kdN7H83Dv3rjSkNhGDFaNDOm4erha/buUt2ERheptgzTMsMPITLt3mXgpZu+G1KsOohCeaBmb/ToD/5rtLLqKJplrLycTHGtah/rzFDMrI3hDA92U2rXMiziCu4bdtFx2PnOjsFi+WA2p0YPprDWtyJvZYWywzA4YiYdIM9X29O44WIcBzHXtofBy5hSHQzFNq5/Rb93nbGj1zOcGEMwBMksQ9HUTkN6oZFUlNZDsU4m/qm7mNh8c5zkmddB+YOpRbu6YrB/er3fPcrKVQxvJNnThHK2loCQtuJKx6MMsaWvnvZPVVoquO0npT7jk4M11HqgBEgRws0oycdk/RfLm95Jd64phLC0fUx5c4DZxO3FGIKBmD8Cfgy9jAa345Pa0Oe3z8MmBvLg5b2v/b4JhlgVfSbucSL2+jrZXsVUAoL7Dzz+iuWlLjs9izV0Kt2vB4cTQoBF2InRHeXMzgUnYLGoBuD3wXJ8SifNC7Gxlxgla6hQMK8C19Sxexps4jBStzGFQ+kZI1dd+j6Og1emqWOEAcmAUoVZYxBOLMZ6we0vblgiXIxFAOMc7EUtzepDAnchGlN6AFytCBjTru33vPAdO/b//NXs1bz1Mn6lHsgiwxaAgynyWHd27KYpGrlLfFY2/JflK+XmQoBvzAO5RJe2bmVqt7VyVPOS+4KVMrgyxjj6cZrXtcTfx+3oda36JrbSlHQ4jLlpP5W11FNnyAsx1ayWVL1PvhkgxVAPt7TiRx4WHkEbj6psltlfua688mfLzwF8L9hY4LpCb5Gh4YqwFVUXIkGWv1/mrZKg7AJy5qdRUbqb4hLJoN4mZj71ZbozxMeFztlJstpEoYICFwIobBhT7o+PNXh8/+BFapy E/8/utVI fcmUdpamR/YFYsx/3eYPRHjAGp+ahSNyJ7cQ1pJxyk1RwZhJNCY7QgkUuUf8bpDhzCxvr8odNdOlykD0MAoLkDFTL4/3xqJ3igcsJ4KJH4NnmXQvHrQFMBfjCQ+DCW5Rm3Kb0qAre6TAvaTxlemdaZoAnCGwWBYR3mzR2cbhIMBY+gqXj7+6PStiGxu/hJiMUaTSWW6jB7fAP3XV86d5KU2kkTR43SWtX3C46FcogduCa61qlX/9pK1oWezJ8/Ld4orSmY/CZct0KtVXllUjXLZcLjSs6agOaKUI2Jk6ZbHUSJK0xdMkgl8FLofScbIC0RBvk/Yzjk0JzMtDesiqt+38RSTUxM8tMpEV5x6MFcE/2dHGzXGG77pYwzVRWOg40q+TiK+WsC+PCvKrZhFh4Gv2Q6937Yyntir77C79o+NVsARxNl6mMrLibXBElSmJt6moFinIe9K9gWZ0GIn+ykwnIuM/gRrRgytzwxoTojEnFMqQ8TMbgtcqAvmrM8uf96d5nw5cyYEMq3wgW8vfL8HPYG5Y2fF/tK+yg/23asjHvsH9ENFNuU4jXIcrTt/8BxdcxefZ9WVg1XxQBHT2Dc7y/rJI0jS6xyl9TT5XRRtDi7+kcCOABn02+D7PcB4PCrE324N5NNCAns0D6R11PgbxmkWdzhWsttG9oCEocBsb8HaBZTXUTMU47YwXvhUYrm7oS+JxgN4i2qzg4ECnZpa+e4PBqAArOuwba X-Bogosity: Ham, tests=bogofilter, spamicity=0.000240, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ackerley Tng writes: > > > I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`. Yes, after looking into this more deeply, I agree that avoid_reserve means avoiding the reservations in the resv_map rather than reservations in the subpool or hstate. Here's more detail of what's going on in the reproducer that I wrote as I reviewed Peter's patch: 1. On fallocate(), allocate page A 2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested 3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE 4. On fork, prep for COW by marking page as read only. Both parent and child share B. 5. On faulting *buf = 2 (write fault), allocate page C, copy B to C + B belongs to the child, C belongs to the parent + C is owned by the parent 6. Child exits, B is freed 7. On munmap(), C is freed 8. On unlink(), A is freed When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is the root of the bug. We should decrement h->resv_huge_pages if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set. If avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we won't be decrementing h->resv_huge_pages anyway. I agree with Peter's fix as a whole (the entire patch series). Reviewed-by: Ackerley Tng Tested-by: Ackerley Tng --- Some definitions which might be helpful: + h->resv_huge_pages indicates number of reserved pages globally. + This number increases when pages are reserved + This number decreases when reserved pages are allocated, or when pages are unreserved + spool->rsv_hpages indicates number of reserved pages in this subpool. + This number increases when pages are reserved + This number decreases when reserved pages are allocated, or when pages are unreserved + h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages. More details on the flow in alloc_hugetlb_folio() which might be helpful: hugepage_subpool_get_pages() returns "the number of pages by which the global pools must be adjusted (upward)". This return value is never negative other than errors. (hugepage_subpool_get_pages() always gets called with a positive delta). Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other than errors). If the return value is 0, the subpool had enough reservations and so we should decrement h->resv_huge_pages. If the return value is 1, it means that this subpool did not have any more reserved hugepages, and we need to get a page from the global hstate. dequeue_hugetlb_folio_vma() will get us a page that was already allocated. In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1 page, and there are no available_huge_pages() left, we quit dequeueing since we will need to allocate a new page. If we want to avoid_reserve, that means we don't want to use the vma's reserves in resv_map, we also check available_huge_pages(). If there are available_huge_pages(), we go on to dequeue a page. Then, we determine whether to decrement h->resv_huge_pages. We should decrement if a reserved page from the subpool was used, instead of whether avoid_reserve or vma_has_reserves() is set. In the case where a surplus page needs to be allocated, the surplus page isn't and doesn't need to be associated with a subpool, so no subpool hugepage number tracking updates are required. h->resv_huge_pages still has to be updated... is this where h->resv_huge_pages can go negative?