From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26133C6FD1D for ; Thu, 23 Mar 2023 10:36:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A97B66B0093; Thu, 23 Mar 2023 06:36:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A47876B0095; Thu, 23 Mar 2023 06:36:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 90F1D6B0096; Thu, 23 Mar 2023 06:36:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 7E1346B0093 for ; Thu, 23 Mar 2023 06:36:30 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 44E1940615 for ; Thu, 23 Mar 2023 10:36:30 +0000 (UTC) X-FDA: 80599808940.13.1FAB0BE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 4D56240004 for ; Thu, 23 Mar 2023 10:36:28 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="UBe82/AY"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679567788; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VhVooo4nst+piw+966WEP8+flU0JtuouwHqwDjEizd0=; b=XdkBX/nOQ8olO2ANpuP6jCfLI1l5uswsFWRYX5yzInciMvTQdhcALXzRtCZ8q52XjSYTJq l0NBjTFLOJcAeUE175kDU+UiXK7VjNCIcdvAcvYbPZt9B1V70MmPCBi0/35RQAGtnULN0T u+aES2CjsWLJKaSe9f+s5p1k9LF6VGg= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="UBe82/AY"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679567788; a=rsa-sha256; cv=none; b=OoVPwCTdXCjBIbsgt0UNzT4QovV9i2ZmKBPrC+QUAURvgAt7GbClTvc4hB4+31IHSlgs2e Ri9wfArcqlOk2mgdZnsgHVS3NYrO1zgUzmNHlZM5m8vYyl2vw9q7N+cb1Ta8GVNlAo6L/a xFhzfp+FHT2kIzfcCh8YSM29WCGafuQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679567787; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=VhVooo4nst+piw+966WEP8+flU0JtuouwHqwDjEizd0=; b=UBe82/AYYukV6WwNtkrgjaXs0ABcUVxEyo5wH2NR5WBHcyDeyC1AJqx/qpTQmAvu7gDN1s 8pRUTk61eRDFKOG5dW1+GBPNETTozCscU9B+8qCnxkOIp5mIT69DjaeauyS3ZgZcYWsFRs muE8BkwDIoW9w1glQZJY3kevqimORbo= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-350-C8q_ApxOMh-rVZdsr9HRdg-1; Thu, 23 Mar 2023 06:36:23 -0400 X-MC-Unique: C8q_ApxOMh-rVZdsr9HRdg-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B3B473C0F673; Thu, 23 Mar 2023 10:36:22 +0000 (UTC) Received: from localhost (ovpn-12-97.pek2.redhat.com [10.72.12.97]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 5D41F1121314; Thu, 23 Mar 2023 10:36:21 +0000 (UTC) Date: Thu, 23 Mar 2023 18:36:17 +0800 From: Baoquan He To: Lorenzo Stoakes Cc: David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Andrew Morton , Uladzislau Rezki , Matthew Wilcox , Liu Shixin , Jiri Olsa , Jens Axboe , Alexander Viro Subject: Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Message-ID: References: <941f88bc5ab928e6656e1e2593b91bf0f8c81e1b.1679511146.git.lstoakes@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 X-Rspamd-Queue-Id: 4D56240004 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: 6qc5bpahztd9izz7kkwgr59kpbtzhnyo X-HE-Tag: 1679567788-56034 X-HE-Meta: U2FsdGVkX1+uN9L9KqvCnNTFcO6zrKZNO5AxqDEv8kDKTC4mKsy7HZupKEBKPiDQTuKN01tWmtYUWzUK7kUgtkH2BK8MvMVylaI+A9xYyakSVATMcCDyC7hA+F8qHA/Ocm6zGykBkG1qVEk9yOpd3/9zZEofkThQCukNPmL7MnqlcNvhYCFQ28uhGsrElSKVeLWGY1RxUQgii2fy4PVqz5U4sjYEXx21OQ3+8DtdHzW6Iq0DNBJ1B+861iy78nA03zDLP9xkwQHgX5o62Q3U/I6O9ri1+ozpW5+Ts6D3wQspCmEmdb1pGVt4JeJmXlOntFoKav4z/QzNAaPILOcNXbmgrh8nb37To3e3lpEyErZ4iSBaQy0qEIK7dFq3JfvbkVe6DAiHXvpew/J839ysdFnQaB3vrnJ24LATDcvOkoFwhfp3hYwqSDu4ip0Lb6BCnDSKBBkrouq+wOBLlgZ8+71G/gT/zJgVhdESEV90miyJP0i8F8tFo5Zt5kGURilrpLc1WZOv2cNfNReX2zc6EtUl2iV0+6ajbSPHgTYlZhLnJr4MRZFfx2wNz0eSYld0g/qf95Q3dmOV3HBMhkJxsqdceGH3HNYiB5+PQDgBi5+Zcq8aDoLhOe1NSpzGxaPtc0bQP7xUR0iSgx7bTEXQwcEg9rJ1VNKozVhd6zqJqGP3Xcem/uDEqBodPU2VXhCkAT4eMc720JHdaHbJQzTqWpfa0xYxryDeffrdHTP09UtQEbq615ePi8s5ZA9EXjo18VC7bko688gjBefU1EEdOzev9d1KervIZID2l5DmoZIuBZmv8pHRYzqQivxlIZcN6vRRy2UU9h3Do3Kd7gRfeC+KAThaPDJo5v1Q+MM17FBj8eshaPclJPg5ivndLYixg+vn263owRz6iDBiOVqySJoXs3zuMlvkbrb2hL2Ki71Pr+HbPrAIy9MYmBXx+P5O060w3sHyHpq13K4ZW6k EFv92B85 pAnNo0hqgpnQrs5tvab7HCPfYfPaZMOPj7GWVn40hUSK8vPwcjT+t8pmLWNl9xGcM76kN6d9kg0wIUR9Z4iPSIWesYcWYDLG5EZ52lofQ6uojL1IBbfUTh4mkEeh2GTif9HRTIaNwxQQR0mHqlqCN3TGG3S6liZj5OLwLE11RKO3rejaXUBj8KtYSTO3pgN68k9jbtucLQaEjTvV7xf7xrhFXigBxzn8LWojc7JfEslMhO6mroPk4xEET2eT5KUChvzTqXuQax7lpBCY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 03/23/23 at 06:44am, Lorenzo Stoakes wrote: > On Thu, Mar 23, 2023 at 10:52:09AM +0800, Baoquan He wrote: > > On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote: > > > Having previously laid the foundation for converting vread() to an iterator > > > function, pull the trigger and do so. > > > > > > This patch attempts to provide minimal refactoring and to reflect the > > > existing logic as best we can, for example we continue to zero portions of > > > memory not read, as before. > > > > > > Overall, there should be no functional difference other than a performance > > > improvement in /proc/kcore access to vmalloc regions. > > > > > > Now we have eliminated the need for a bounce buffer in read_kcore_iter(), > > > we dispense with it, and try to write to user memory optimistically but > > > with faults disabled via copy_page_to_iter_nofault(). We already have > > > preemption disabled by holding a spin lock. We continue faulting in until > > > the operation is complete. > > > > I don't understand the sentences here. In vread_iter(), the actual > > content reading is done in aligned_vread_iter(), otherwise we zero > > filling the region. In aligned_vread_iter(), we will use > > vmalloc_to_page() to get the mapped page and read out, otherwise zero > > fill. While in this patch, fault_in_iov_iter_writeable() fault in memory > > of iter one time and will bail out if failed. I am wondering why we > > continue faulting in until the operation is complete, and how that is done. > > This is refererrring to what's happening in kcore.c, not vread_iter(), > i.e. the looped read/faultin. > > The reason we bail out if failt_in_iov_iter_writeable() is that would > indicate an error had occurred. > > The whole point is to _optimistically_ try to perform the operation > assuming the pages are faulted in. Ultimately we fault in via > copy_to_user_nofault() which will either copy data or fail if the pages are > not faulted in (will discuss this below a bit more in response to your > other point). > > If this fails, then we fault in, and try again. We loop because there could > be some extremely unfortunate timing with a race on e.g. swapping out or > migrating pages between faulting in and trying to write out again. > > This is extremely unlikely, but to avoid any chance of breaking userland we > repeat the operation until it completes. In nearly all real-world > situations it'll either work immediately or loop once. Thanks a lot for these helpful details with patience. I got it now. I was mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter. Now is there any chance that the faulted in memory is swapped out or migrated again before vread_iter()? fault_in_iov_iter_writeable() will pin the memory? I didn't find it from code and document. Seems it only falults in memory. If yes, there's window between faluting in and copy_to_user_nofault(). > > > > > If we look into the failing point in vread_iter(), it's mainly coming > > from copy_page_to_iter_nofault(), e.g page_copy_sane() checking failed, > > i->data_source checking failed. If these conditional checking failed, > > should we continue reading again and again? And this is not related to > > memory faulting in. I saw your discussion with David, but I am still a > > little lost. Hope I can learn it, thanks in advance. > > > > Actually neither of these are going to happen. page_copy_sane() checks the > sanity of the _source_ pages, and the 'sanity' is defined by whether your > offset and length sit within the (possibly compound) folio. Since we > control this, we can arrange for it never to happen. > > i->data_source is checking that it's an output iterator, however we would > already have checked this when writing ELF headers at the bare minimum, so > we cannot reach this point with an invalid iterator. > > Therefore it is not possible either cause a failure. What could cause a > failure, and what we are checking for, is specified in copyout_nofault() > (in iov_iter.c) which we pass to the iterate_and_advance() macro. Now we > have a fault-injection should_fail_usercopy() which would just trigger a > redo, or copy_to_user_nofault() returning < 0 (e.g. -EFAULT). > > This code is confusing as this function returns the number of bytes _not > copied_ rather than copied. I have tested this to be sure by the way :) > > Therefore the only way for a failure to occur is for memory to not be > faulted in and thus the loop only triggers in this situation. If we fail to > fault in pages for any reason, the whole operation aborts so this should > cover all angles. > > > ...... > > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c > > > index 08b795fd80b4..25b44b303b35 100644 > > > --- a/fs/proc/kcore.c > > > +++ b/fs/proc/kcore.c > > ...... > > > @@ -507,13 +503,30 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter) > > > > > > switch (m->type) { > > > case KCORE_VMALLOC: > > > - vread(buf, (char *)start, tsz); > > > - /* we have to zero-fill user buffer even if no read */ > > > - if (copy_to_iter(buf, tsz, iter) != tsz) { > > > - ret = -EFAULT; > > > - goto out; > > > + { > > > + const char *src = (char *)start; > > > + size_t read = 0, left = tsz; > > > + > > > + /* > > > + * vmalloc uses spinlocks, so we optimistically try to > > > + * read memory. If this fails, fault pages in and try > > > + * again until we are done. > > > + */ > > > + while (true) { > > > + read += vread_iter(iter, src, left); > > > + if (read == tsz) > > > + break; > > > + > > > + src += read; > > > + left -= read; > > > + > > > + if (fault_in_iov_iter_writeable(iter, left)) { > > > + ret = -EFAULT; > > > + goto out; > > > + } > > > } > > > break; > > > + } > > > case KCORE_USER: > > > /* User page is handled prior to normal kernel page: */ > > > if (copy_to_iter((char *)start, tsz, iter) != tsz) { > > >