From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B95EEB64D9 for ; Thu, 6 Jul 2023 22:05:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A93A6B0072; Thu, 6 Jul 2023 18:05:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 65A1B8D0001; Thu, 6 Jul 2023 18:05:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 548DC6B0075; Thu, 6 Jul 2023 18:05:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4347C6B0072 for ; Thu, 6 Jul 2023 18:05:10 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E74611C9046 for ; Thu, 6 Jul 2023 22:05:09 +0000 (UTC) X-FDA: 80982568338.10.B021922 Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50]) by imf23.hostedemail.com (Postfix) with ESMTP id 84A1514000C for ; Thu, 6 Jul 2023 22:05:05 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=WXmiyVCY; spf=pass (imf23.hostedemail.com: domain of surenb@google.com designates 209.85.221.50 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688681105; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=095hVI2EWirxAlkmhEDyiv1L1+keL+0mbtKAzVVggwk=; b=u/nkVbUH+r1KJS/oRUSo9DLbffRdXPjAi/7Tpcl1LQ8Mum7cKUMuUK6mGtuKhg4sO6tS00 b3QGk3uIB/3uGPokVoj5gly7rnggfkMmVzslEJADm6yK1SSdoIFkOg15pUsGP8ONZVd0Up kDtOw2wjZW8aICfc8BwY4fuSF+yt+bg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688681105; a=rsa-sha256; cv=none; b=1QgZ7OYMC6you+ykkRqVhHwWZeyEehZ7M2/YU2QEWXoRqcGsDda4BLxfKKkMh9aEoj7ney y5/Y68+GTVVtHwSn8NrGZVSONc7KHOq9aDsLNjgMCreiU2DmgHm6+eiamkBYq9pfcsN5eO 8PYZoR1ohlESp+bn2hFs6kXlltH/gzY= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=WXmiyVCY; spf=pass (imf23.hostedemail.com: domain of surenb@google.com designates 209.85.221.50 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wr1-f50.google.com with SMTP id ffacd0b85a97d-314313f127fso1209740f8f.1 for ; Thu, 06 Jul 2023 15:05:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688681104; x=1691273104; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=095hVI2EWirxAlkmhEDyiv1L1+keL+0mbtKAzVVggwk=; b=WXmiyVCY+NLlIK9v9BEY8CAwJuUEHu7nxwfCTtrrUYJQPCh9MowVSJDhRqGxSwhLNV vHw0OzGaFZaZX/qNdLD0QRzodzYCKlaUa0utyPpMbv5rPtyfEYGXmMqWJa3Nkp+TqSay G8WoN+aFMELxcjhw/xb6rJdHUOFYTq1jeV9J009LNUuzyCau4Q2ZD+UTCVkdBGP/ihAR f/lnhjKeToOeqSakJ3SCIURANruWkY8IGOGbBXNhUwdmpshSAhvDeiCB95mCW2p15SG7 j8c3+ymI+YXyUQ1H3sP9oqyboocoC0wyh8sJtysqOYl7ERavUso7qY48UK1RQjAOs1Ml yjdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688681104; x=1691273104; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=095hVI2EWirxAlkmhEDyiv1L1+keL+0mbtKAzVVggwk=; b=Bf1uRoSvdgJSwX2FbHdcRnNZSG6vdNhCC9yNfaqVdVvMlutBqtM7yjxH/pV8mnw3LR 260IppHZb6rKcn5YjOJR7Cax/3vb5CnEF1zvvi1JsmYYVcCZM5gaZF3NOkGkc4/nUtwG KpJTQm8V60/ecesVSSPbOFLrzMQEBBgLJ3izuj+oDQYdNU1mGTZ63y50T8BzJqYYOii2 zrGlD4cxHuBj42jZStuXz41mQk5WpU4rDGjhfotu5jye9md58e2zmTDhxYi3lghlpN0G UQ7Xf7rWk61eqz+M+bOjXGMWyuSVVtadO7prfS4aYABiiQy2z9hDzxxrLqkk+5lqUqFF uYew== X-Gm-Message-State: ABy/qLa+WeUnU2uY9kh6NKc559e8Oy6bokfJC+Z3Kmw675rAXu3s9GIF siUGIcpXcNQbH6B5Bz5/PUyspSUz1jiTcbUlYyKqkw== X-Google-Smtp-Source: APBJJlF+1RaJl/q4Src0LsWuzMtRR0Y38MJITaCoy8SGhE5EiEEerwpRMwFAhnSBiK0C/DDDzupmZ4QeKgA345lyk/c= X-Received: by 2002:adf:fed1:0:b0:30e:4515:1529 with SMTP id q17-20020adffed1000000b0030e45151529mr2084387wrs.37.1688681103470; Thu, 06 Jul 2023 15:05:03 -0700 (PDT) MIME-Version: 1.0 References: <20230606124939.93561-1-yu.ma@intel.com> <20230606192013.viiifjcgb6enyilx@revolver> <20230705165411.tfqqipcla7exkb7k@box.shutemov.name> <20230705173348.rxgzxge6ipb4hapy@revolver> <20230705190601.4atlxzh2wxc7zlu6@box.shutemov.name> In-Reply-To: <20230705190601.4atlxzh2wxc7zlu6@box.shutemov.name> From: Suren Baghdasaryan Date: Thu, 6 Jul 2023 15:04:51 -0700 Message-ID: Subject: Re: [PATCH] mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock To: "Kirill A. Shutemov" Cc: "Liam R. Howlett" , Yu Ma , akpm@linux-foundation.org, tim.c.chen@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, dave.hansen@intel.com, dan.j.williams@intel.com, shakeelb@google.com, pan.deng@intel.com, tianyou.li@intel.com, lipeng.zhu@intel.com, tim.c.chen@linux.intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 84A1514000C X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 5mfr4d9neh1ed9zbo1aefpijen1jsdgs X-HE-Tag: 1688681105-749625 X-HE-Meta: U2FsdGVkX19DDHDYPwWSACVfLhH3fzp/mFqK3YAva89DPrWyHPG1R9OCCsH+MMHu4i+L3J8sZeg2LWlp8wRGFDfDOknWmr1Q7PY5NTHRxkzouFP74d8b0QwWnWKLt353hnllXeuUzcsA+B8jnOV7BqhIc9xQFUXHhF6dC5IEDKp+d0JITV2SfzEeEBgVnUwroIfwfgJTwCJgJzNeacjIKGcgjK6LYNxwg4IIrk6N95N6ovMtOYG3e4ErUw+vj/PR19oxkotlcRGiwCyJ9bzAee62mtqdhbo7DdN7Si5GIUuKIwOWfIBZgUWSG67bweMUYdNO3sJhYaGYfVVOUc8Ph3KqXM5k7tHLjVgUHf9SLL3DHChxtL/5dzz4rqfnds1OwKM7y06esdfHi2ay4H4603l2avtnIR4HiIoKTjFRNLvou7q/34lCfQOmpVOYlEpEtaPL9E6RowBOcdD3ejGTZ8kgvm+f5NPwc+GYKaIEjcm3F7KE8wjOaGo19F270r6VGF+EgjW28ztUqUXVOM2Tq302COUMnI7YBd1baKxxsUFNGeCvYTGxFsAvDANXyopKYaFSo0zrtoET3ludZ67IRQslrsFPiu/Sz8LWoeqw/MHZGtTG42YkrIDoJHyioe1EvzUkknssv0CTFpN0xyLLPgMLcCxFggn1cTpxCsmNWSx1fUK5HBJ5K3LrDNshexMpjmZMc9CQr5xmIVUyrQ9XWPS/fg2sI6t8ZlYeH4BugUIKNB6YjyfbqSmuU47oagJbVIBJNd4ISMLCFDyFwxAU0JW9oH4nfawhqeSDdbwg4xODhiPwY82ZieJcCTDN91yiFoIvOvv1Eq/ulVM/jIs+3XWwhqGwUEtOJtb6vO32OXhLlHTB/WFbPt7FdRyu32uqsbc+p07V90srcp5rVdpdCRrHjBpHvpMLZRk+hC65KNoMb/ddXpYwqllGyW7ZHqMw1iniEw+b9nlLvqsmvPg pDFETogd hCh76lOKj2+gX1yLhqkblsM2Q6EufN07TFgEO9OJuchft+tWlQsgePZAgJ4lwbJOWeyiJti4d72QR2f1tcnZTLhCeNY0HDZ/NWzcN/CC1y9qHjbdguPY3ylTOw/ZkLx9mJFEh+M14EW3scCO/KPU6v3T2LtVXW28QGavffJfKKik1RyYuYZM/pu4/RYvsF8/2PGtpKL2SkUdt7d/DS1esNtuAApFpNtD96yQkDL0uWovd+C5dFqE2NJlo9hc4Y9rpMYyqr8qNB1cUKP8CP/89uBjmHTKsJ3Lrd5j+n43VXLYShgI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 5, 2023 at 12:06=E2=80=AFPM Kirill A. Shutemov wrote: > > On Wed, Jul 05, 2023 at 01:33:48PM -0400, Liam R. Howlett wrote: > > * Kirill A. Shutemov [230705 12:54]: > > > On Tue, Jun 06, 2023 at 03:20:13PM -0400, Liam R. Howlett wrote: > > > > * Yu Ma [230606 08:23]: > > > > > UnixBench/Execl represents a class of workload where bash scripts= are > > > > > spawned frequently to do some short jobs. When running multiple p= arallel > > > > > tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both = of them > > > > > come from load_elf_binary through the call chain > > > > > "execl->do_execveat_common->bprm_execve->load_elf_binary". In do_= mmap,it will > > > > > call mmap_region to create vma node, initialize it and insert it = to vma > > > > > maintain structure in mm_struct and i_mmap tree of the mapping fi= le, then > > > > > increase map_count to record the number of vma nodes used. The ho= t osq_lock > > > > > is to protect operations on file=E2=80=99s i_mmap tree. For the m= m_struct member > > > > > change like vma insertion and map_count update, they do not affec= t i_mmap > > > > > tree. Move those operations out of the lock's critical section, t= o reduce > > > > > hold time on the lock. > > > > > > > > > > With this change, on Intel Sapphire Rapids 112C/224T platform, ba= sed on > > > > > v6.0-rc6, the 160 parallel score improves by 12%. The patch has n= o > > > > > obvious performance gain on v6.4-rc4 due to regression of this be= nchmark > > > > > from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: co= nvert > > > > > mm's rss stats into percpu_counter). > > > > > > > > I didn't think it was safe to insert a VMA into the VMA tree withou= t > > > > holding this write lock? We now have a window of time where a file > > > > mapping doesn't exist for a vma that's in the tree? Is this always > > > > safe? Does the locking order in mm/rmap.c need to change? > > > > > > We hold mmap lock on write here, right? > > > > Yes. > > > > >Who can observe the VMA until the > > > lock is released? > > > > With CONFIG_PER_VMA_LOCK we can have the VMA read under the rcu > > read lock for page faults from the tree. I am not sure if the vma is > > initialized to avoid page fault issues - vma_start_write() should eithe= r > > be taken or initialise the vma as this is the case. > > Right, with CONFIG_PER_VMA_LOCK the vma has to be unusable until it is > fully initialized, effectively providing the same guarantees as mmap writ= e > lock. If it is not the case, it is CONFIG_PER_VMA_LOCK bug. Jumping into the conversation. If we are adding a VMA into the tree before it's fully usable then we should write-lock it before it becomes visible to page faults. Kirill is right that there is a problem and we should not rely on vma->vm_file->f_mapping lock here. Instead we should write-lock the VMA before adding it into the tree even without this change. IIUC, the rule with mmap_lock is that VMA can be safely modified if it is either isolated or if we write-locked the mmap_lock. With CONFIG_PER_VMA_LOCK the same rule should apply to per-VMA locks - the VMA should be either isolated or we should write-lock it. Here we are adding the unlocked VMA into the tree and then we keep modifying it. This did not bite us because modifications are only done to file-backed VMAs and we do not handle file-backed page faults under per-VMA locks yet, however this will become a problem once we start supporting that. If we all agree to the above I can post a change to lock the VMA before adding it into the tree. > > > There is also a possibility of a driver mapping a VMA and having entry > > points from other locations. It isn't accessed through the tree though > > so I don't think this change will introduce new races? > > Right. > > > > It cannot be retrieved from the VMA tree as it requires at least read= mmap > > > lock. And the VMA doesn't exist anywhere else. > > > > > > I believe the change is safe. > > > > I guess insert_vm_struct(), and vma_link() callers should be checked an= d > > updated accordingly? > > Yep. > > -- > Kiryl Shutsemau / Kirill A. Shutemov >