From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2541C74A5B for ; Sun, 26 Mar 2023 14:21:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 366126B0072; Sun, 26 Mar 2023 10:21:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2EE7E6B0074; Sun, 26 Mar 2023 10:21:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 169B46B0075; Sun, 26 Mar 2023 10:21:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F42346B0072 for ; Sun, 26 Mar 2023 10:21:02 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BC1E212046B for ; Sun, 26 Mar 2023 14:21:02 +0000 (UTC) X-FDA: 80611261164.24.6298B40 Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54]) by imf12.hostedemail.com (Postfix) with ESMTP id 915AF4001A for ; Sun, 26 Mar 2023 14:21:00 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=bsg1Ww16; spf=pass (imf12.hostedemail.com: domain of lstoakes@gmail.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=lstoakes@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679840460; a=rsa-sha256; cv=none; b=V1381iSp7T6BgBXOlRXB4Xt4+rs4nUeTzePM/jCnmgXAIKqnO4auG5e8QYX+Ti6F1FaKOu ZBJX0sfkd0EJrZpo0e3/Naq8+AuRCIrQrsiNUyiaZh3CZoYsBjMLWtYVB932obBZGQdPRV bvni2s9+ZmJTL478HeJj1qTMyzvn0Ts= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=bsg1Ww16; spf=pass (imf12.hostedemail.com: domain of lstoakes@gmail.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=lstoakes@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679840460; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1cJMZnqUsBgrNdX88leORoUPzqPUZDQUspY2yewjVqs=; b=GWKY8QNzQduD39vK9SmwBQ45ux1xEmkRBtdYCa+At2PTnVdtZF0tTREAmoG7PfmEHFmWAV EgQ0YYLYKXxYy9iS0hC4clePNxZHCAT1bpoxIaUD3z7PbeyyZq2Ghj6cjPZVOkBkY6EWjA FQPi10nwX1PRgT1MSBiAe0VUWoZJCWY= Received: by mail-wr1-f54.google.com with SMTP id h17so6078864wrt.8 for ; Sun, 26 Mar 2023 07:21:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679840459; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=1cJMZnqUsBgrNdX88leORoUPzqPUZDQUspY2yewjVqs=; b=bsg1Ww16uXAdzsN9tnYPmboinU7IVqUqXYdIeiqDNyghvMNrHjSlj6BzF/AYnvkgC3 Lc9hMtxQaXWskmyxtMMG6Viw7yv0IX1A2ZGk1xbIUK4OqCvCUqxRaMh0ruUNr83hZ3w2 NO12WEHmHpFptqjQPZrwOvhvwa18mUy4nvIQCRbPPtPW3DW5Pg4G7amvVG+kJFR4RnhK s3kAOktRgBXEOX3upoSNG5IzasXvHJ4t5f4RlcodKbYO54IJyxeysljugvNWovMGM/JI x+WFNttDZZWHrxzArp/uKKODRPR6ar82ufL2WZ3ihpCTd3RHtze2Bbwo5JjCrC7uCZ1J BGyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679840459; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1cJMZnqUsBgrNdX88leORoUPzqPUZDQUspY2yewjVqs=; b=wZwRfUzZbCK58D4nBfIjkpKoHHouGN/7Ub2LHRosRVJ72I6RqABXP1t9MtZfPcfHf7 UYcgnRrUx40FFshqCINvx9EzAcZdDTQd/KMarVb3QnAfKnyNLrR6jNbnOw599Ci8A1LU XmVyxM+qHemVeWjIarKCjOOurYIbvVMPDtly3a8kHYT6TL6wL9ed8jYjAfuW2ezy4xE8 To4WMfBmC3ooTaaQOv083zk8m0cRm1Fib/F47QJG8xIXzwRdGyMP80BCATS66v0kvg7J Pg/C7DpIvH9faeU1YJTrOnTDk+u9L2kl5YUg8PJFT+F0tqdaEYWDwJQ+SFpHNig40C0E OFBw== X-Gm-Message-State: AAQBX9dKOHXt4B7NP9Zm5ItkkcgsxhQ0x5zwHDvh4yLoOaBPVpBpqJs9 Um9GmlcjE49rpPXMVQ8witU= X-Google-Smtp-Source: AKy350ZHigTNNCX9fxVkA2dbVTJk/enyX94yk83jX+DJYZT2kDKB1pL/+IZYMYuNZPOZuBxJ+yVZrw== X-Received: by 2002:a5d:4248:0:b0:2ce:a8a2:37d7 with SMTP id s8-20020a5d4248000000b002cea8a237d7mr6501298wrr.27.1679840458729; Sun, 26 Mar 2023 07:20:58 -0700 (PDT) Received: from localhost ([2a00:23c5:dc8c:8701:1663:9a35:5a7b:1d76]) by smtp.gmail.com with ESMTPSA id b13-20020adff90d000000b002c54c92e125sm22796795wrr.46.2023.03.26.07.20.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 26 Mar 2023 07:20:57 -0700 (PDT) Date: Sun, 26 Mar 2023 15:20:56 +0100 From: Lorenzo Stoakes To: David Laight Cc: 'Baoquan He' , David Hildenbrand , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , Andrew Morton , Uladzislau Rezki , Matthew Wilcox , Liu Shixin , Jiri Olsa , Jens Axboe , Alexander Viro Subject: Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Message-ID: <01d87b9f-3c8b-4f92-89c2-3e07420e9c67@lucifer.local> References: <941f88bc5ab928e6656e1e2593b91bf0f8c81e1b.1679511146.git.lstoakes@gmail.com> <7aee68e9-6e31-925f-68bc-73557c032a42@redhat.com> <0cff573c3a344504b1b1b77486b4d853@AcuMS.aculab.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0cff573c3a344504b1b1b77486b4d853@AcuMS.aculab.com> X-Rspam-User: X-Rspamd-Queue-Id: 915AF4001A X-Rspamd-Server: rspam01 X-Stat-Signature: 3yk47a87dfnnza9qxiwmug4wiamxrujh X-HE-Tag: 1679840460-478111 X-HE-Meta: U2FsdGVkX1+abcWzzKYTkyZghTmctG3zR8/E+0cb/Z7Yn+fNzeykhUNrsw0Quwh0qUd3w4goVidlpvz/6wkOmjcWIsSzK94FRIPtKmX3kclrdEiu0mQLBkkhSrFsFDGzLF/w59/CVZ2YJbVNs63t9P7NRB2hdt8KKkRaPQyBVZkIdDt4SD1CHHQq2GXzcFvle754auGkJd+QqPGdfjaBvLCha8YO1jhrcnMl7TUCQLvsQ7VPw5I79MpW0EnapQ1YDyj2vK0TmKqaWdurdOujO1kEwsHUXpSOwvK9YlyTonE0k2apdrEeVNgmIBx7W0P6vZj9pBpPC2LYh8IFiW9NcFuVyeKl1TuYAsZnA7weSZfJbAgQwTNW7y5NIiKovLjmJZqxQMKgn6pZlbpQwC+cm7kUKqqjiOf1neYuICl4tclO5azBSPHVySUsPdUMUqgL6rxximWZao56zNcwerRF798gLpe9yHaysdjokCRDxIGXYYjUzA6fElJm3ny+169svlooYLwGXYz4xiLStNRk8kISmBln5aYBG0tsgRbjCrGQk9XTERvyrgbFl9J4vMoybAPnd2LPYVbB2/66yR16DppZJsuabnT7UUWGhHwRBZfdeEnFTKc3otSq+M8Z/MbUpue62TVaN0MseCvAss7KmY2DTWF71eC9yvlCw4m1R7HXH0zFcHS8ev7nc4obBuVRvKdQlKl6+/xEoBVwOXSvVty8NOdxge+4Bt0zCBDa5VJtyr+ZB47LPsju+njwrCt/7kTi8irpbc/dcozY2H5iwJO1KaenRsnCYQl5OatvceGNWocoqkcdhBPF2eg+PgeEbIA+up2gIDLWqwNgG6jHhSYiHPC6VeLZbHCVMjXCxve+eCZd61Uh7/p6Dt906s5CUN6y7J7TbngI/SITM96A2Ve82RR587rdXGsuMa6RbzYYgJnBGntC9DSqU82EypXDmkHEhdy8G7fhHz1p0aa CUBLkBcu fbX+p00rLpINyJidCqAEg9MuiMulJt1LaUK6Xt3Lwq0EMZOtkQ/sjuxe2Fg8HfIppbB3DaeSvcVaT+Euj+JAH1IDHPlSRbXQoOH3S+lo71M/t5B6zbApyoK+eHYpUclK0codf1nQNi+zDhJ2NHyaR82MtWSV7s9kO2G4+XFyZrnamrHzoFI2rCzV2MHWXzFWq25SBOzOcqYg1zqdIrAd28hUgeDEQ7y1BlqZ8PtvhJsPC8U1tDlHMivuSx8jYW++TjoGO/hDLotA3LKcajvZ9xFLuPqBbTiUosTUI1Akbs4Fn2o+SVC5BQgVdlx+6+yCokV6g/XOfdliV+2pOGwOWe/xcw1CB24ZVVaKlqwHVFN8NWpLIs+KFxEaWfBz61iF3qKlJJUpNA9F5tBlMdW/j4jxHsHokcyojD0PQ+NfgpCX881Uj48Q4Uj1FO0/Fn+aJ4hmw9JgShlmp9fbfqQk8kayfNRAJLPLeUZ8TSxvz3oPV6uYMlkh4QEGAv8DN7PSUh1aS X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Mar 26, 2023 at 01:26:57PM +0000, David Laight wrote: > From: Baoquan He > > Sent: 23 March 2023 13:32 > ... > > > > > If this fails, then we fault in, and try again. We loop because there could > > > > > be some extremely unfortunate timing with a race on e.g. swapping out or > > > > > migrating pages between faulting in and trying to write out again. > > > > > > > > > > This is extremely unlikely, but to avoid any chance of breaking userland we > > > > > repeat the operation until it completes. In nearly all real-world > > > > > situations it'll either work immediately or loop once. > > > > > > > > Thanks a lot for these helpful details with patience. I got it now. I was > > > > mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter. > > > > > > > > Now is there any chance that the faulted in memory is swapped out or > > > > migrated again before vread_iter()? fault_in_iov_iter_writeable() will > > > > pin the memory? I didn't find it from code and document. Seems it only > > > > falults in memory. If yes, there's window between faluting in and > > > > copy_to_user_nofault(). > > > > > > > > > > See the documentation of fault_in_safe_writeable(): > > > > > > "Note that we don't pin or otherwise hold the pages referenced that we fault > > > in. There's no guarantee that they'll stay in memory for any duration of > > > time." > > > > Thanks for the info. Then swapping out/migration could happen again, so > > that's why while(true) loop is meaningful. > > One of the problems is that is the system is under severe memory > pressure and you try to fault in (say) 20 pages, the first page > might get unmapped in order to map the last one in. > > So it is quite likely better to retry 'one page at a time'. If you look at the kcore code, it is in fact only faulting one page at a time. tsz never exceeds PAGE_SIZE, so we never attempt to fault in or copy more than one page at a time, e.g.:- if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen) tsz = buflen; ... tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen); It might be a good idea to make this totally explicit in vread_iter() (perhaps making it vread_page_iter() or such), but I think that might be good for another patch series. > > There have also been cases where the instruction to copy data > has faulted for reasons other than 'page fault'. > ISTR an infinite loop being caused by misaligned accesses failing > due to 'bad instruction choice' in the copy code. > While this is rally a bug, an infinite retry in a file read/write > didn't make it easy to spot. I am not sure it's reasonable to not write code just in case an arch implements buggy user copy code (do correct me if I'm misunderstanding you about this!). By that token wouldn't a lot more be broken in that situation? I don't imagine all other areas of the kernel would make explicitly clear to you that this was the problem. > > So maybe there are cases where a dropping back to a 'bounce buffer' > may be necessary. One approach could be to reinstate the kernel bounce buffer, set up an iterator that points to it and pass that in after one attempt with userland. But it feels a bit like overkill, as in the case of an aligment issue, surely that would still occur and that'd just error out anyway? Again I'm not sure bending over backwards to account for possibly buggy arch code is sensible. Ideally the iterator code would explicitly pass back the EFAULT error which we could then explicitly handle but that'd require probably quite significant rework there which feels a bit out of scope for this change. We could implement some maximum number of attempts which statistically must reduce the odds of repeated faults in the tiny window between fault in and copy to effectively zero. But I'm not sure the other David would be happy with that! If we were to make a change to be extra careful I'd opt for simply trying X times then giving up, given we're trying this a page at a time I don't think X need be that large before any swap out/migrate bad luck becomes so unlikely that we're competing with heat death of the universe timescales before it might happen (again, I may be missing some common scenario where the same single page swaps out/migrates over and over, please correct me if so). However I think there's a case to be made that it's fine as-is unless there is another scenario we are overly concerned about? > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >