From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5B25C00144 for ; Sat, 30 Jul 2022 02:12:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF8F88E0001; Fri, 29 Jul 2022 22:12:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ECEB66B0072; Fri, 29 Jul 2022 22:12:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D96B88E0001; Fri, 29 Jul 2022 22:12:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C946D6B0071 for ; Fri, 29 Jul 2022 22:12:18 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 85BB3AB7FB for ; Sat, 30 Jul 2022 02:12:18 +0000 (UTC) X-FDA: 79742141556.07.F7A4DD7 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf17.hostedemail.com (Postfix) with ESMTP id F1FB740036 for ; Sat, 30 Jul 2022 02:12:16 +0000 (UTC) Received: from canpemm500002.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4LvnrZ0c8VzlW5K; Sat, 30 Jul 2022 10:09:34 +0800 (CST) Received: from [10.174.177.76] (10.174.177.76) by canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Sat, 30 Jul 2022 10:12:10 +0800 Subject: Re: [RFC PATCH v4 6/8] hugetlb: add vma based lock for pmd sharing synchronization To: Mike Kravetz CC: , , Muchun Song , Michal Hocko , Peter Xu , Naoya Horiguchi , David Hildenbrand , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , James Houghton , Mina Almasry , Pasha Tatashin , Axel Rasmussen , Ray Fucillo , Andrew Morton References: <20220706202347.95150-1-mike.kravetz@oracle.com> <20220706202347.95150-7-mike.kravetz@oracle.com> <5b8c6b49-e17a-2c0b-4440-ccf3c5493cb2@huawei.com> From: Miaohe Lin Message-ID: Date: Sat, 30 Jul 2022 10:12:09 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.76] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To canpemm500002.china.huawei.com (7.192.104.244) X-CFilter-Loop: Reflected ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; spf=pass (imf17.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659147138; a=rsa-sha256; cv=none; b=H04wvP7PSVbSID0aH2eR3IEyR0PrZYAdjk0qUdRZ8iPMD0yRaXJMqlkKiPKNIpG19PoiIy cQ6rAO8/4BdDH8X1aP2cYJoPdR0W6pcJIgVoEo9XD7SJWPPyCLJAeA/GoEQ86d3Is8UrXu aFfOj5oBK3IgiiYjyDpEUGGC2v0WQ9o= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659147138; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BcGPOv54xqjDqF8U+YfxDPMra2CclItsFmTQZyNHMfo=; b=Fv7tP9yxkLpbm5oAVG1MtNbK40S+EivlZ5rQrs5FHBxRJsBIAGhHEqRF10BkUAD3vSvpQY luOy7496CXET6vdD9BHK6kMIZYH7bTm+GcLhEXmAgdMEr2kO4Xtshugx8vPXe67Hz1B/nB 8HrZLUMyMObMIOUI1sRdkidGlnsHvg8= Authentication-Results: imf17.hostedemail.com; dkim=none; spf=pass (imf17.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com X-Rspamd-Server: rspam04 X-Rspam-User: X-Stat-Signature: eb4mzi9os1iiaut5668xe7e4wwgux6qc X-Rspamd-Queue-Id: F1FB740036 X-HE-Tag: 1659147136-727702 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2022/7/30 2:00, Mike Kravetz wrote: > On 07/29/22 10:55, Miaohe Lin wrote: >> On 2022/7/7 4:23, Mike Kravetz wrote: >>> Allocate a rw semaphore and hang off vm_private_data for >>> synchronization use by vmas that could be involved in pmd sharing. Only >>> add infrastructure for the new lock here. Actual use will be added in >>> subsequent patch. >>> >>> Signed-off-by: Mike Kravetz >>> --- >>> include/linux/hugetlb.h | 36 +++++++++- >>> kernel/fork.c | 6 +- >>> mm/hugetlb.c | 150 ++++++++++++++++++++++++++++++++++++---- >>> mm/rmap.c | 8 ++- >>> 4 files changed, 178 insertions(+), 22 deletions(-) >>> >> >> >> >>> >>> /* Forward declaration */ >>> static int hugetlb_acct_memory(struct hstate *h, long delta); >>> +static bool vma_pmd_shareable(struct vm_area_struct *vma); >>> >>> static inline bool subpool_is_free(struct hugepage_subpool *spool) >>> { >>> @@ -904,6 +905,89 @@ resv_map_set_hugetlb_cgroup_uncharge_info(struct resv_map *resv_map, >>> #endif >>> } >>> >>> +static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma) >>> +{ >>> + return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) && >> >> Should me make __vma_aligned_range_pmd_shareable check (VM_MAYSHARE | VM_SHARED) like above >> instead of VM_MAYSHARE to make code more consistent? >> > > I 'think' we want them to be different. Note this subtle code and > explanation in __unmap_hugepage_range_final(). > > /* > * Clear this flag so that x86's huge_pmd_share page_table_shareable > * test will fail on a vma being torn down, and not grab a page table > * on its way out. We're lucky that the flag has such an appropriate > * name, and can in fact be safely cleared here. We could clear it > * before the __unmap_hugepage_range above, but all that's necessary > * is to clear it before releasing the i_mmap_rwsem. This works > * because in the context this is called, the VMA is about to be > * destroyed and the i_mmap_rwsem is held. > */ > vma->vm_flags &= ~VM_MAYSHARE; > > So, when making a decision to share or not we need to only check VM_MAYSHARE. > When making decisions about about the vma_lock, we need to check both. In most > cases, just VM_MAYSHARE would be sufficient but we need to handle this case > where VM_SHARED and !VM_MAYSHARE. Mostly in the unmap/cleanup cases. Many thanks for your explanation. :) > >>> + vma->vm_private_data; >>> +} >>> + >>> +void hugetlb_vma_lock_read(struct vm_area_struct *vma) >>> +{ >>> + if (__vma_shareable_flags_pmd(vma)) >>> + down_read((struct rw_semaphore *)vma->vm_private_data); >>> +} >>> + >>> +void hugetlb_vma_unlock_read(struct vm_area_struct *vma) >>> +{ >>> + if (__vma_shareable_flags_pmd(vma)) >>> + up_read((struct rw_semaphore *)vma->vm_private_data); >>> +} >>> + >>> +void hugetlb_vma_lock_write(struct vm_area_struct *vma) >>> +{ >>> + if (__vma_shareable_flags_pmd(vma)) >>> + down_write((struct rw_semaphore *)vma->vm_private_data); >>> +} >>> + >>> +void hugetlb_vma_unlock_write(struct vm_area_struct *vma) >>> +{ >>> + if (__vma_shareable_flags_pmd(vma)) >>> + up_write((struct rw_semaphore *)vma->vm_private_data); >>> +} >>> + >>> +int hugetlb_vma_trylock_write(struct vm_area_struct *vma) >>> +{ >>> + if (!__vma_shareable_flags_pmd(vma)) >>> + return 1; >>> + >>> + return down_write_trylock((struct rw_semaphore *)vma->vm_private_data); >>> +} >>> + >>> +void hugetlb_vma_assert_locked(struct vm_area_struct *vma) >>> +{ >>> + if (__vma_shareable_flags_pmd(vma)) >>> + lockdep_assert_held((struct rw_semaphore *) >>> + vma->vm_private_data); >>> +} >>> + >>> +static void hugetlb_free_vma_lock(struct vm_area_struct *vma) >>> +{ >>> + /* Only present in sharable vmas */ >>> + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED))) Does above check need a comment? VM_SHARED is check here but not in below hugetlb_alloc_vma_lock. Thanks. >>> + return; >>> + >>> + if (vma->vm_private_data) { >>> + kfree(vma->vm_private_data); >>> + vma->vm_private_data = NULL; >>> + } >>> +} >>> + >>> +static void hugetlb_alloc_vma_lock(struct vm_area_struct *vma) >>> +{ >>> + struct rw_semaphore *vma_sema; >>> + >>> + /* Only establish in (flags) sharable vmas */ >>> + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED))) > > Based on my explanation above, I think this should only check VM_MAYSHARE. > >>> + return; >>> +> + if (!vma_pmd_shareable(vma)) { >>> + vma->vm_private_data = NULL; >>> + return; >>> + } >>> + >>> + vma_sema = kmalloc(sizeof(*vma_sema), GFP_KERNEL); >>> + if (!vma_sema) { >>> + /* >>> + * If we can not allocate semaphore, then vma can not >>> + * participate in pmd sharing. >>> + */ >>> + vma->vm_private_data = NULL; >>> + } else { >>> + init_rwsem(vma_sema); >>> + vma->vm_private_data = vma_sema; >>> + } >> >> This code is really subtle. If it's called from hugetlb_vm_op_open during fork after >> hugetlb_dup_vma_private is done, there should already be a kmalloc-ed vma_sema for this >> vma (because hugetlb_alloc_vma_lock is also called by hugetlb_dup_vma_private). So we >> can't simply change the value of vm_private_data here or vma_sema will be leaked ? > > Yes, I believe your analysis is correct. > >> But >> when hugetlb_alloc_vma_lock is called from hugetlb_reserve_pages, it should work fine. >> Or am I miss something? > > You are right. This is an issue in the current code. I will address in > the next version. > > Thanks for all your comments on this series! >