From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8AEEEB64D9 for ; Fri, 7 Jul 2023 04:41:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 527CC8D0002; Fri, 7 Jul 2023 00:41:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D8048D0001; Fri, 7 Jul 2023 00:41:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C7A58D0002; Fri, 7 Jul 2023 00:41:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2C8E48D0001 for ; Fri, 7 Jul 2023 00:41:11 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id F42221A0412 for ; Fri, 7 Jul 2023 04:41:10 +0000 (UTC) X-FDA: 80983566300.19.804D690 Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf17.hostedemail.com (Postfix) with ESMTP id 3261D40002 for ; Fri, 7 Jul 2023 04:41:08 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=wsUTUzSv; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of surenb@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688704869; a=rsa-sha256; cv=none; b=62hS2GqeRzjJeOrhB52aFoIIEVDT9ffPKjue6xMHvX4QULNCuiQaUEOjYHkwfHpNreGY5E MYq5jqpaoWW36do8D+rCSpn6oxrExqyBGDfMJID4WlGlwEoPGz4cHvKTtEiObZWBe2UGmw r3XAazLyHoPDk+XRTIRvnOvWP0pC1kI= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=wsUTUzSv; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of surenb@google.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688704869; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ckOL5yFaE2flgZND4kc95RvBBFmf8gW77xFQV6lXywU=; b=gMdXMhkJyC/1bpMtHDGNMLkvcGH1sHT3v0K06aTKHhcxWFMFo+oWrBNgrnn/MHHJGYR5Ti EkaIjr8A99HHIhRQJbF2heKPiBH43LhHp1dN5DZO7KskSOjAVah1+DpVd4UN+P6+UW3fb8 +Lr7IAFa93danJ7wCwPdYnPL0MaZqBE= Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-bad0c4f6f50so2160339276.1 for ; Thu, 06 Jul 2023 21:41:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688704868; x=1691296868; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ckOL5yFaE2flgZND4kc95RvBBFmf8gW77xFQV6lXywU=; b=wsUTUzSvSu209HVqegIZt7JwhstZ2QHOs8ZxtV29iwAJ9L5PrSdy//fIcId6deP46F l0BdDKaldHTdo3UyF/Y/Oipztc9NiXwaV1yFLsNIYqFC3VKBjxMn4y5m1sI6FDNq0fS7 RPfjpNaS1FEQiiYT33fl8k2zryDpsK40pLGSl8qOlOep1j9kzjO1dO4ki77WPpBRGBzh ySheUrWUnCsfmQBYUzjfU/hezEves3eMO78aZV2BBU9MTSmC6yB5DpBtvVr5hx0S8lBQ 1pHYdkeb5OOfdnpQMgm0CkVyFIXMUjHrx9I33Y16iSglvWsGa7IuiadJJwbF/vxgsFfu W6Ag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688704868; x=1691296868; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ckOL5yFaE2flgZND4kc95RvBBFmf8gW77xFQV6lXywU=; b=f45OFLuis8OWi9XjizBQPitDv2RNW1ANMHcIIObIZyN1HqYiycjEckb648PwaL2itT 3fHAtZ/RbyNVpBYBeAtJz6fy+VpwuNxrm4YL8RV19YtsAaWtJtK8wxEnPtVBhGmzxn2N TrkdTHWjgVDc7ckKeoca7i6cxgBVTz7AXz14awM47YzoCfe964C7C9ls0FBfpnug2pg2 90Kf3a2IU3RVoN+fQaLMBnuFt56yy1nwbg8rjeNiIUDa6k3G0RYo0x4qimuAQEAOFawD fozfq/WCbLiMrnA2XORPv6jBVapmGEC8o+lUHy3cSjpif7Y1/jZAhIgxmTMCSeGWgZjj 5Q9Q== X-Gm-Message-State: ABy/qLbo8jydncAExzpw1lBMRZB+WeUIGxmjM38oRdAt8n/w2afunQGO XpwsqWnDZ0eHxFYMS8uhd9da+ve7OU5gsNRnHAt09w== X-Google-Smtp-Source: APBJJlG4Mp7SGvFnM5AUda+Jrz6qEig89UbtxySFOP/2+BOpyys6uA2AiKCRcDvyH4yy2lBCB7/UqN+vcuqQj5t4bAI= X-Received: by 2002:a25:f826:0:b0:ba8:2a74:155 with SMTP id u38-20020a25f826000000b00ba82a740155mr4920668ybd.32.1688704868051; Thu, 06 Jul 2023 21:41:08 -0700 (PDT) MIME-Version: 1.0 References: <20230606124939.93561-1-yu.ma@intel.com> <20230606192013.viiifjcgb6enyilx@revolver> <20230705165411.tfqqipcla7exkb7k@box.shutemov.name> <20230705173348.rxgzxge6ipb4hapy@revolver> <20230705190601.4atlxzh2wxc7zlu6@box.shutemov.name> In-Reply-To: From: Suren Baghdasaryan Date: Thu, 6 Jul 2023 21:40:56 -0700 Message-ID: Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock To: "Kirill A. Shutemov" Cc: "Liam R. Howlett" , Yu Ma , akpm@linux-foundation.org, tim.c.chen@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dave.hansen@intel.com, dan.j.williams@intel.com, shakeelb@google.com, pan.deng@intel.com, tianyou.li@intel.com, lipeng.zhu@intel.com, tim.c.chen@linux.intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 3261D40002 X-Stat-Signature: tmpqchotwbe75ikhahi89wrri1j9m7qx X-HE-Tag: 1688704868-732012 X-HE-Meta: U2FsdGVkX1+9uLTB0XsiyTIgrt0HFlI5++NxBnM/Ll7Rgj3Xw3YN15Cx1bDEhdh9QKrtQBxcWmdlNRfLBQWyUQb7nY1iKOvzAU302plixQNG6uX6R7CvcvHqPoYLau+4bNtI+l6czqM9LxraLlcHqum52rXuFmJhN2usmVShq5Jm0ACyloPpHW5bcXszmGunYzmnMUo1FJdyhiZ/YEKoqUA4dCHSedvBLcdfjs23JyX6tDGd7g+I4ANy7Nk0bhCchojD22QM4N9cIhqPRGm46tPLXBax8yNhz7sEdjBR8UnjPtvQY+8fUEDzHsUmYE/8+tnzLeDLOL9AL/0oHTNih/hZ+4TrBsEcmuiTWJepjfGlGzGdBPJKL8MqBDVmpWdOnNCTSbV3OtnrA+l921svR+zaB9UZyAJpJKRK+NoryNpAIkdCykmbzcQHF3b4n6/6Jk2DeeUhTrKwLMDFWQBifyIneTOcv/vrhR2q350cAKQWFM4Udn6/hhEXDh3oV4KtleUml+bZXEgUGjR4p9soAErvtp5H6iKBN2JzbbPR6r+bYsQrdEUBFaJGmWA3ewNmdyi+LEZWsGNmIyGUOQ2noQxK9nEfFAxH/V+H4Wre5Re/hVxFr5/8bIc+i7x39GolC7vm7pWauGzqG+NUK74xwUxK1HFlAIJBEbOCEip4qXWG9vl8quioNMW1oVmBL/R+dCPCbpcHm0JIipIGqZ4UNL1dZVRZZiHMha/E3XugtLzfI/njjAAGHg5RFtM1XiR5FlYnRuz+tBwTpw09uYToOKZ3OixVKRWooeZNhOb4DQk9CndQYO/ozUWcLR6ig19apyw26ehlD+knm93S+Ix5uRBeZxdmWzRIglRj4d9PRa/mwHjA0UpVUqE4S3eJ2eJKb96F3wrOJodIf0NYZeeul/NTVs07u9Gh9lZdLnK4v3lAuHI7XpQ8qiBZ7YEf1D2mm58eB+U0CBLsig16+Ct vZcjx8ui y8NMXJnjqkvqSCEqEmH8x/Nj+BoVhPDm6ELmLxpDHdM8ZnrDDR5Of6XoZCA7lScF3Wd2i7XIrmd16q/F/XUBE1niJKsRDvkP+n8IJQ7hczkoBjKDNV3TQEeqsn4EWApMhw2gghQFlQX+5U+uIOIdymIe+dZq9CUzYqtaYKeacJVX+DI4bwhpBGo4rRcihICvSCrsu/3k9hsNeJuumNg+z1dVuWWLq4pYwDgybYc6W5OOHByVtKXzOpjy1iJMTmt78/QNbou8OgUkWykFHRK/BiyJ1ZXgNjJUvwyLjZBykzSFrpVYiF3pAoQN53g+wicJZRrvmGNqBrrx7R8WHLIxKn5yY9qmCaKZ/oJYwnRXdUYQPistNIhbIunGPkFWIPlA0ap77 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 6, 2023 at 3:04=E2=80=AFPM Suren Baghdasaryan wrote: > > On Wed, Jul 5, 2023 at 12:06=E2=80=AFPM Kirill A. Shutemov wrote: > > > > On Wed, Jul 05, 2023 at 01:33:48PM -0400, Liam R. Howlett wrote: > > > * Kirill A. Shutemov [230705 12:54]: > > > > On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote: > > > > > * Yu Ma [230606 08:23]: > > > > > > UnixBench/Execl represents a class of workload where bash scrip= ts are > > > > > > spawned frequently to do some short jobs. When running multiple= parallel > > > > > > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Bot= h of them > > > > > > come from load_elf_binary through the call chain > > > > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In d= o_mmap,it will > > > > > > call mmap_region to create vma node, initialize it and insert i= t to vma > > > > > > maintain structure in mm_struct and i_mmap tree of the mapping = file, then > > > > > > increase map_count to record the number of vma nodes used. The = hot osq_lock > > > > > > is to protect operations on file=E2=80=99s i_mmap tree. For the= mm_struct member > > > > > > change like vma insertion and map_count update, they do not aff= ect i_mmap > > > > > > tree. Move those operations out of the lock's critical section,= to reduce > > > > > > hold time on the lock. > > > > > > > > > > > > With this change, on Intel Sapphire Rapids 112C/224T platform, = based on > > > > > > v6.0-rc6, the 160 parallel score improves by 12%. The patch has= no > > > > > > obvious performance gain on v6.4-rc4 due to regression of this = benchmark > > > > > > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: = convert > > > > > > mm's rss stats into percpu_counter). > > > > > > > > > > I didn't think it was safe to insert a VMA into the VMA tree with= out > > > > > holding this write lock? We now have a window of time where a fi= le > > > > > mapping doesn't exist for a vma that's in the tree? Is this alwa= ys > > > > > safe? Does the locking order in mm/rmap.c need to change? > > > > > > > > We hold mmap lock on write here, right? > > > > > > Yes. > > > > > > >Who can observe the VMA until the > > > > lock is released? > > > > > > With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu > > > read lock for page faults from the tree. I am not sure if the vma is > > > initialized to avoid page fault issues - vma_start_write() should eit= her > > > be taken or initialise the vma as this is the case. > > > > Right, with CONFIG_PER_VMA_LOCK the vma has to be unusable until it is > > fully initialized, effectively providing the same guarantees as mmap wr= ite > > lock. If it is not the case, it is CONFIG_PER_VMA_LOCK bug. > > Jumping into the conversation. > > If we are adding a VMA into the tree before it's fully usable then we > should write-lock it before it becomes visible to page faults. Kirill > is right that there is a problem and we should not rely on > vma->vm_file->f_mapping lock here. Instead we should write-lock the > VMA before adding it into the tree even without this change. > IIUC, the rule with mmap_lock is that VMA can be safely modified if it > is either isolated or if we write-locked the mmap_lock. With > CONFIG_PER_VMA_LOCK the same rule should apply to per-VMA locks - the > VMA should be either isolated or we should write-lock it. Here we are > adding the unlocked VMA into the tree and then we keep modifying it. > This did not bite us because modifications are only done to > file-backed VMAs and we do not handle file-backed page faults under > per-VMA locks yet, however this will become a problem once we start > supporting that. > If we all agree to the above I can post a change to lock the VMA > before adding it into the tree. This patch is addressing the issue with per-VMA locks here: https://lore.kernel.org/all/20230707043211.3682710-2-surenb@google.com/ > > > > > > There is also a possibility of a driver mapping a VMA and having entr= y > > > points from other locations. It isn't accessed through the tree thou= gh > > > so I don't think this change will introduce new races? > > > > Right. > > > > > > It cannot be retrieved from the VMA tree as it requires at least re= ad mmap > > > > lock. And the VMA doesn't exist anywhere else. > > > > > > > > I believe the change is safe. > > > > > > I guess insert_vm_struct(), and vma_link() callers should be checked = and > > > updated accordingly? > > > > Yep. > > > > -- > > Kiryl Shutsemau / Kirill A. Shutemov > >