From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79F4DC3DA7F for ; Thu, 15 Aug 2024 13:44:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D6D16B00FB; Thu, 15 Aug 2024 09:44:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0879F6B00FC; Thu, 15 Aug 2024 09:44:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E90D76B00FD; Thu, 15 Aug 2024 09:44:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C606E6B00FB for ; Thu, 15 Aug 2024 09:44:20 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 66ADFA8E8E for ; Thu, 15 Aug 2024 13:44:20 +0000 (UTC) X-FDA: 82454599080.17.25C7F44 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf23.hostedemail.com (Postfix) with ESMTP id 78009140007 for ; Thu, 15 Aug 2024 13:44:18 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ivNlLrP4; spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723729361; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OEWVOSOkejG3Jox0hqATz/vxP68NCLIvNT+z/Tsv/XM=; b=aSjrq9IvvAnFVGKFzLgKq5tGNRLImIgANoqsMREKMCp8gWTBfICPZPtKL8YWNeLz1OSIkb zWBkK5QIMzIvEqeIwJKqQ4lbpDdlJAyc1mCGjuwOkz9I15gzexiwiBumutgNgK0RjDCdT2 +7ug/iSSCAnpM7UcC/9nLe6WWED0u4M= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ivNlLrP4; spf=pass (imf23.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723729361; a=rsa-sha256; cv=none; b=X7RXZJlYGAmNlg3X5FtzrUgsjHzGtHkXn3pbLtoFv/plRou1THL+q+hpFHXyWeW8+8vSyJ Wp5HeD4xqyQ3NI5FizYIe49LHrDZfgFIK214Z1Pmh2cFKpNGFVCBFOfq6+S2Q/qyaQIMgb aF/v10/x0xOBz5aUdzk5ExLf5WEoEbA= Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-5a10bb7bcd0so1322593a12.3 for ; Thu, 15 Aug 2024 06:44:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723729457; x=1724334257; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=OEWVOSOkejG3Jox0hqATz/vxP68NCLIvNT+z/Tsv/XM=; b=ivNlLrP4fLSWyVNTYNTKpB2Em9JJqyOVR0r6cucL0r/Jl68QkdNWfOcIZgjt2osVeb 84SVLHbUfrunX7SuDNIcLIms3p+LL3O2WtUCGJaL11RXVobReul+Tq8ZZI7v+qbpNLl1 mn1ajyh4bViif3KWhugauxqPb0KsXNop9zpFe1+Cdhk1mhMselpvMjtg8aDJzXecm7D5 N6ZZC+aEHfCXwPyNWloxNgtdZWXKGxNJNq9J9TdiNs/wxnXNZKbrOjMCVbfrDIPQzh33 n/OddybcQf6pYIffAtM5+Sk+wJ6Pn1gF7Rht9JYBre8IqVxSAfQcgN0iFjZHOFlkzJSK vsgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723729457; x=1724334257; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=OEWVOSOkejG3Jox0hqATz/vxP68NCLIvNT+z/Tsv/XM=; b=isyRzV3PySkM9qDFU/7rlNtFh6hSS5DPIZwQ/04wkJj1FcJ1J3zxumbdICgyJ2m2gP 9L1BO+OT8JOqvFk5gxi/O76qxcexg5+ewW2u9RyPbu5CoxfEPMmmRBUtNUrc4X5/EJ8F oKWUjJwf1uL0sm41cShsk6v7pls39Xr7/OlUq5WHGIgw7g0QeKMU85mX2StFOXiXtAsF SWPaA1EkBz2asL6IKbObm7nu4DdCpurhFvt39a0izcHNt4RLMhcCmg6HV+TmV3Co/++L KZcXuSw4pB04qT99nXaGmcaiobTnXr1Oghvvs33nSV3+tgWJ45XMRLvfjjEUVYN3Lr7O KvWA== X-Forwarded-Encrypted: i=1; AJvYcCUezZ2vHl+UT4EK77k/nm8Di1N+wvGN9iSbpbnRDvp2blKHLFqgbMq+i+APC+uPnPH549edCGEdhyaUZXmgp4l/hz8= X-Gm-Message-State: AOJu0YyCzawKhpKcNSM2nbfYphQh8XiR2tmN8p9D44faFgq8vUCJQhxH /UKjdXzRlseX9EZ0d3jqZLe+8vf/CAVoQ/dIhk2XWXDsDjQ99Twk X-Google-Smtp-Source: AGHT+IHXmPnNO1BQhxO6MMFT6/TIgWYHu9VHKbep57A+dn6tZDRSYYtt+JQEDnN+t7z56GsriXVnxw== X-Received: by 2002:a05:6402:1913:b0:5be:c852:433d with SMTP id 4fb4d7f45d1cf-5bec8524614mr211111a12.15.1723729456450; Thu, 15 Aug 2024 06:44:16 -0700 (PDT) Received: from f (cst-prg-76-86.cust.vodafone.cz. [46.135.76.86]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5bebbe7ecc7sm913742a12.62.2024.08.15.06.44.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Aug 2024 06:44:15 -0700 (PDT) Date: Thu, 15 Aug 2024 15:44:05 +0200 From: Mateusz Guzik To: Suren Baghdasaryan Cc: Andrii Nakryiko , linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org Subject: Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution Message-ID: References: <20240813042917.506057-1-andrii@kernel.org> <20240813042917.506057-14-andrii@kernel.org> <7byqni7pmnufzjj73eqee2hvpk47tzgwot32gez3lb2u5lucs2@5m7dvjrvtmv2> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 78009140007 X-Stat-Signature: mmk6tbiuptjeu5duhro4y5wgbbbgbtip X-Rspam-User: X-HE-Tag: 1723729458-947723 X-HE-Meta: U2FsdGVkX1+ChGl9U7s6areuPyON1kSAdktaJ9KgpFSMp3teaXFuLe0ffDN6AqfSxZ5itflR7KzQWUH4Z6/6KYOIR3fmE4h+eF/EIilNIWQEu/kSYT10bSvM1dYXO4EfjIE5DrqL6gW9xW7SGZTId4OwDkfjl8EkDEZ01aS7pm3mzv0uriz7qjjTL2WBJ1c9WqNZmMkHdmqkWWB5CFrZ2Xbb1UxTOwvlB3EqNAyll06VRig6TBMQQZ6/ZB4bS0TmQAH8iKqY/852lyQFqxud7mlVMNd+CMJxp/HpQkndQSqV2R0wb3c0LdDZtbzAjI6s3uPGR9aGt0lCM9bcVWiLsjXBPnXMGmh2xVRayGf07IlJNlQM+om29bQac7Mt0CyY5tjOl0fA0/qQ1VnQH9I6fkXssKImscWaolm0995ikJLHTG/Dch2u0FYKqOkpbkfVPOiS5f83ZwFu1BhH6waZ1AZ5BaT4O9/nMqYXGJIeFgJH7WYLA54uavXUMqF7X8Q2hZVCpoQvQJFnS2Dbuk2v0DEiQHWzzcS7Es9LsP8q6I6ZuzcnBjQe7vS+AerulsXk8ylzGHgKHfnaEuBySAAAQCky7iIJOlHUd7R0nG+mb61KLQTdGqNeJ9u6bnVfUb56x1AGuIC7cVkqV7jbhVmJvCi668tPtPHQpaKzw7LbLdNHQMn2yfwBNb4Kv66IZgEuZ801MSzjbozko+4lSxFMqVGOLmgq47TPO47jmtFoMvAInRhl/cex9Wg3aSUAdyfEtXzrYlzN/FQ1+1Rb7FIjmkrufFWa5lhyq4zDlUmbBkmESPIf4zrjXVGRdhvDF33lSPkFFpEo9wUVBKomBqe8UTuvH+lnHRw8s6eHA+A422H/DbZSziaKCtrj7Gzk+/IpSk87NuCKNOl8ZyiGeLM8MjC4WkVxnk2djuEO87490HCtmDEQM9qWKIVBLC4aliXiREhSkzh6u66q/Ie6rjr VqI1Y44t rfx9vQSHv5NT5sXIkEAAzSTtwZ/aj1ITKPsQfWc0JL1rZUDMM4/fFyriFR4kRcu93Wq5TdflHNW3p50OkpApg8saMG6YglJZZ0Is46z2FUFGaMwCuswCJqoiEDHeAfjVLIx1Hr1LGE4UjoXG8Bjhx49ZDTQ/FCXDT3nKxSc/LK8KCkuwB6UGVVmfP1foGSNjC1q8Gi02iXpd4sn4FjSMeoN8k8VXUN1yMvIonpr8ljmXyCqzQe+s5/XNvhBO9ePPQy65gP8NGNT6tQxgyltB1wbyiDJFWl0oAxhNbmQNpbL5FXUSNa8YWNgqDwO0KX6Uls7/XL8BHHKSnXnr9fJ0XGr0OIEjDpg52GZOqdfeSjV2SIhKStyQvQ1VRCri3MNGtndojn2zddQSvTFGz4QUuAmHoTyNMhxcifma6BKLdtrPkRwg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote: > On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik wrote: > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote: > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection, > > > attempting uprobe look up speculatively. > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() helpers to > > > validate that mm_struct stays intact for entire duration of this > > > speculation. If not, we fall back to mmap_lock-protected lookup. > > > > > > This allows to avoid contention on mmap_lock in absolutely majority of > > > cases, nicely improving uprobe/uretprobe scalability. > > > > > > > Here I have to admit to being mostly ignorant about the mm, so bear with > > me. :> > > > > I note the result of find_active_uprobe_speculative is immediately stale > > in face of modifications. > > > > The thing I'm after is that the mmap_lock_speculation business adds > > overhead on archs where a release fence is not a de facto nop and I > > don't believe the commit message justifies it. Definitely a bummer to > > add merely it for uprobes. If there are bigger plans concerning it > > that's a different story of course. > > > > With this in mind I have to ask if instead you could perhaps get away > > with the already present per-vma sequence counter? > > per-vma sequence counter does not implement acquire/release logic, it > relies on vma->vm_lock for synchronization. So if we want to use it, > we would have to add additional memory barriers here. This is likely > possible but as I mentioned before we would need to ensure the > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway > there (it implements acquire/release logic), we just had to ensure > mmap_write_lock() increments mm->mm_lock_seq. > > So, from the release fence overhead POV I think whether we use > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence > here. > Per my previous e-mail I'm not particularly familiar with mm internals, so I'm going to handwave a little bit with my $0,03 concerning multicore in general and if you disagree with it that's your business. For the time being I have no interest in digging into any of this. Before I do, to prevent this thread from being a total waste, here are some remarks concerning the patch with the assumption that the core idea lands. >From the commit message: > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access > vma->vm_file->f_inode lockless only under rcu_read_lock() protection, > attempting uprobe look up speculatively. Just in case I'll note a nit that this paragraph will need to be removed since the patch adding the flag is getting dropped. A non-nit which may or may not end up mattering is that the flag (which *is* set on the filep slab cache) makes things more difficult to validate. Normal RCU usage guarantees that the object itself wont be freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_RCU flag weakens it significantly -- the thing at hand will always be a 'struct file', but it may get reallocated to *another* file from under you. Whether this aspect plays a role here I don't know. > +static struct uprobe *find_active_uprobe_speculative(unsigned long bp_vaddr) > +{ > + const vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE; > + struct mm_struct *mm = current->mm; > + struct uprobe *uprobe; > + struct vm_area_struct *vma; > + struct file *vm_file; > + struct inode *vm_inode; > + unsigned long vm_pgoff, vm_start; > + int seq; > + loff_t offset; > + > + if (!mmap_lock_speculation_start(mm, &seq)) > + return NULL; > + > + rcu_read_lock(); > + I don't think there is a correctness problem here, but entering rcu *after* deciding to speculatively do the lookup feels backwards. > + vma = vma_lookup(mm, bp_vaddr); > + if (!vma) > + goto bail; > + > + vm_file = data_race(vma->vm_file); > + if (!vm_file || (vma->vm_flags & flags) != VM_MAYEXEC) > + goto bail; > + If vma teardown is allowed to progress and the file got fput'ed... > + vm_inode = data_race(vm_file->f_inode); ... the inode can be NULL, I don't know if that's handled. More importantly though, per my previous description of SLAB_TYPESAFE_BY_RCU, by now the file could have been reallocated and the inode you did find is completely unrelated. I understand the intent is to backpedal from everything should the mm seqc change, but the above may happen to matter. > + vm_pgoff = data_race(vma->vm_pgoff); > + vm_start = data_race(vma->vm_start); > + > + offset = (loff_t)(vm_pgoff << PAGE_SHIFT) + (bp_vaddr - vm_start); > + uprobe = find_uprobe_rcu(vm_inode, offset); > + if (!uprobe) > + goto bail; > + > + /* now double check that nothing about MM changed */ > + if (!mmap_lock_speculation_end(mm, seq)) > + goto bail; This leaks the reference obtained by find_uprobe_rcu(). > + > + rcu_read_unlock(); > + > + /* happy case, we speculated successfully */ > + return uprobe; > +bail: > + rcu_read_unlock(); > + return NULL; > +} Now to some handwaving, here it is: The core of my concern is that adding more work to down_write on the mmap semaphore comes with certain side-effects and plausibly more than a sufficient speed up can be achieved without doing it. An mm-wide mechanism is just incredibly coarse-grained and it may happen to perform poorly when faced with a program which likes to mess with its address space -- the fast path is going to keep failing and only inducing *more* overhead as the code decides to down_read the mmap semaphore. Furthermore there may be work currently synchronized with down_write which perhaps can transition to "merely" down_read, but by the time it happens this and possibly other consumers expect a change in the sequence counter, messing with it. To my understanding the kernel supports parallel faults with per-vma locking. I would find it surprising if the same machinery could not be used to sort out uprobe handling above. I presume a down_read on vma around all the work would also sort out any issues concerning stability of the file or inode objects. Of course single-threaded performance would take a hit due to atomic stemming from down/up_read and parallel uprobe lookups on the same vma would also get slower, but I don't know if that's a problem for a real workload. I would not have any comments if all speed ups were achieved without modifying non-uprobe code.