From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E48AEEB64D9 for ; Mon, 10 Jul 2023 09:58:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 33CE66B0081; Mon, 10 Jul 2023 05:58:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2EC776B0082; Mon, 10 Jul 2023 05:58:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18D9D6B0083; Mon, 10 Jul 2023 05:58:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 098B76B0081 for ; Mon, 10 Jul 2023 05:58:01 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id AE86E14018A for ; Mon, 10 Jul 2023 09:58:00 +0000 (UTC) X-FDA: 80995251120.13.2191D6C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 621648000C for ; Mon, 10 Jul 2023 09:57:58 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=H8hicv8n; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688983078; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uxhdeuum+rQdb4EJhhx9EY5tMmwdfqA+NXTAeGpMu/I=; b=jl7yHmaUk/pz3k1hDGOhq3I/1g5E2HucnimCK4E7eE9W7jxYKLS2hsHskoqpuOANbnGBLK Ky8yqahb+nRYeB2Fzk30oRWFkd/iOAIKJz8/rW9JVv9Ff3DQ3AtlmJ6hQokgPrqZDxHGlp JebvSLCLCjbaVMnTPDjTr1tY4QijBac= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=H8hicv8n; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688983078; a=rsa-sha256; cv=none; b=BAEGC2naoZShpGaSxUH8maPDGp4kHCvwefuqkLSf/uPuWcn2cEUScy7NrMfCWeoapWXHrX SzAPjm+PtVADNhGjDX9LNWrSpnR+gxM7srJOBZ5stNADwjest6LLUDiKe9MojFbcvpoDS2 kBg3pRbQ5ib23PQwk9ttM8KviUZltm4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1688983077; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uxhdeuum+rQdb4EJhhx9EY5tMmwdfqA+NXTAeGpMu/I=; b=H8hicv8nFrIgGZeuVWOj2Q76575m2GapogZqW2eGLP70bkJP+9fNp9JkbbYdN3fz1cj9LX JYifKG8lmpkvPjT5Q3sZHNszwZ0Syz1VEH4PZThrEYwz/SEkhbBERnMiZcIf81HsrJdAUp YjepTbG5Y2v+QZ+QC9+fmVNiRdjq7TY= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-122-Gzd0hm5BM-Sj-qneloXFjA-1; Mon, 10 Jul 2023 05:57:53 -0400 X-MC-Unique: Gzd0hm5BM-Sj-qneloXFjA-1 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-3fc07d4c63eso7313815e9.1 for ; Mon, 10 Jul 2023 02:57:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688983072; x=1691575072; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=uxhdeuum+rQdb4EJhhx9EY5tMmwdfqA+NXTAeGpMu/I=; b=KQUkPgItcKYhrJKaLoEKyprk6cyIgVRvRI/teRqGZ+Z1+r31ElL9gCS/YXS4BxwF8Z wl2mcxNaiuehnUkLQO/R7PC+w7V2gakgGoFBfmzqF+q269GH1AvUgi2AT6+kK6z8rR30 GnMFN4CU6ybzJoETVIQaCNpce4xqh2dXB8F5eGv/nLH3fEN5VsaS/uuUe6M2K3AdvaMs dYhpfnL3lEICgw6drywFHZP8uj75mzDjDrvEfUgYaC+ME4lk5RI4zNqDBWIKmtPkgAgH /6bRG9sVQ2q4mLqfdxBvoTmNoSMffj8gW2vSXUJRZHKzyGtSO5CMc2iu0dFsD7s/JrZY xoJg== X-Gm-Message-State: ABy/qLYNG3BShIY5FT7WggVZADADZ0WlynGUKryb3x/AYRyQXqRLsYeX hRxfkyK49XAkQ1fHfr7wIeYGptskLd5MRjZJdxjRxtxGsaK9969WkC4MVVOz4+qrSGkzGCZPRhl WX4oMgR3MWOI= X-Received: by 2002:a05:600c:1d04:b0:3fb:b075:8239 with SMTP id l4-20020a05600c1d0400b003fbb0758239mr13813107wms.4.1688983072385; Mon, 10 Jul 2023 02:57:52 -0700 (PDT) X-Google-Smtp-Source: APBJJlHpBp/d9/gl1N7kcqv3jse3V7N8Lab6TNLefH1IBhwPuBidQSu2KOJazUzSaVfnkMEhjYn5hQ== X-Received: by 2002:a05:600c:1d04:b0:3fb:b075:8239 with SMTP id l4-20020a05600c1d0400b003fbb0758239mr13813094wms.4.1688983071977; Mon, 10 Jul 2023 02:57:51 -0700 (PDT) Received: from ?IPV6:2003:cb:c738:7500:b60f:a446:46f6:5acf? (p200300cbc7387500b60fa44646f65acf.dip0.t-ipconnect.de. [2003:cb:c738:7500:b60f:a446:46f6:5acf]) by smtp.gmail.com with ESMTPSA id n7-20020a5d67c7000000b0030ae3a6be4asm11297801wrw.72.2023.07.10.02.57.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 10 Jul 2023 02:57:51 -0700 (PDT) Message-ID: <967ccf33-0982-6042-e4ce-b0c859b4c3b1@redhat.com> Date: Mon, 10 Jul 2023 11:57:50 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: "Yin, Fengwei" , Matthew Wilcox Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, yuzhao@google.com, ryan.roberts@arm.com, shy828301@gmail.com, akpm@linux-foundation.org References: <20230707165221.4076590-1-fengwei.yin@intel.com> <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com> <436cd29f-44a6-7636-5015-377051942137@intel.com> <676ce1b3-6c72-011e-3a4f-723945db3d31@intel.com> <04efd5eb-06c2-d449-8427-d7c30df962d1@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH 0/3] support large folio for mlock In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 621648000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: knh3xfeifu1dex3c8p8tomiritt3q85r X-HE-Tag: 1688983078-532712 X-HE-Meta: U2FsdGVkX1+DIPgkV1e9IcFs4+cQUMCB9NlclLvyS1thQ5HaU0CcD+AVBpwp5n8Tp2FDubEfy6zYxQQCHAL/bAJC4oFDNA/liOcOccnR+USnogcku6rghek43p0H/a3uKy4mcJG4WqQD0wu/wH9q2ITczJH/j12/dG0knisAYdD5Lueqw6ujNp3phIndkjR9P3eqBRy8C9rjp1DfZZ14dXZxdbRwh1tKTZv9o3E8t8nqaEWLVqBnLZhHuY7/EGTCFxkWdfWJbGrqFi3URg/hUmVuNiYSyVY5Qz5T0CkOd4tRRlYDG5EucjqyPLpqkCyNAIOHjUOdEBrONNANiouS5u48VkrpLypH8a9mMZsdyO93hke3mWKtWnANit1b9PcTxfklZH//W1WBdeeO9hh5Ap+kw9yFJ67KOCJvSIYZOv5IIXVGp7WO+rBn33sXDDlRei7l/vw01VI6OtrSBiBEVauVYtDMF6Yiwt7o8whhdqiYIMDgnXSHtZEvRJlrZolX+LC1kR7nfdRE+3pCDtg4gbGyc2KRT+L8zLPMpkdyJVc7HmQ1OeFX3BQ76HV/tvYKaMRdMdxy7Q96MhwvfqYW9xszQ/LT3PSi35FkZMPDcYavqp9Hk0AjGcJ84mzN+lcdizZ9aKoR0nKzy8A2ThscIHgYubzP3Tmc39wr7VOMP1Akd7ptDvNHfDkQ7xzg9Cptiwu44yPKm9GoPHpA00EdgfETwTf4a0YvaaD9frwicQr3jUfHFnzFpALuNWESeKb0Bprzem+PpleKUU9kAguK4fvYlH8NYdLg7xaJeXP3/qMCznRfDRBY0G2PFcGEXW9/djax0QANSyVcyuGerrn2PfvQD3ob4dCcyjN9QS9LofLJ9vu+st/TiQu5G3zvUsXjPgztrT4LSnZLqoDYJ7my5DNpZoizfcv5SQFdPR36n2mymcGp6fDG4zYa5eXQPB5w1R+/tLlSSusWAtWE5jU f24r9WFU wziLdFzPwac9iJpAWt4pFMSTZnFllxR7S6AP/McfK/apYbj1zJwsoaeuwZ8j3fRgbVR8MuAFMDhLse3RyzyE6YnLFcwnl9kvYIraK2SZYWwhWQzXo4i4f6pGs5YOh+6NIf0IHPiCJC1gDbOR5wdlw5fitWL8G+OeF8D3TY/OtHSBsvfqfAXpV/Y+mXuh9HwICJda7zhAr2W540EZE03HBJERociV0VrolOYabQ8XEHWqz1jOK2ixF22O2ecayH+RZALfuV2FeYPSXoS448ONDt/Ai0QUzNz+jTsEa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10.07.23 11:43, Yin, Fengwei wrote: > Hi David, > > On 7/10/2023 5:32 PM, David Hildenbrand wrote: >> On 09.07.23 15:25, Yin, Fengwei wrote: >>> >>> >>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote: >>>> I would be tempted to allocate memory & copy to the new mlocked VMA. >>>> The old folio will go on the deferred_list and be split later, or its >>>> valid parts will be written to swap and then it can be freed. >>> If the large folio splitting failure is because of GUP pages, can we >>> do copy here? >>> >>> Let's say, if the GUP page is target of DMA operation and DMA operation >>> is ongoing. We allocated a new page and copy GUP page content to the >>> new page, the data in the new page can be corrupted. >> >> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned. > Thanks a lot for clarification. > > So my understanding is that if large folio splitting fails, it's not always > true that we can allocate new folios, copy original large folio content to > new folios, remove original large folio from VMA and map the new folios to > VMA (like it's only true if original large folio is marked as maybe shared). > While it might work in many cases, there are some corner cases where it won't work. So to summarize (1) THP are transparent and should not result in arbitrary syscall failures. (2) Splitting a THP might fail at random points in time either due to GUP pins or due to speculative page references (including speculative GUP pins). (3) Replacing an exclusive anon page that maybe pinned will result in memory corruptions. So we can try to split any THP that crosses VMA borders on VMA modifications (split due to munmap, mremap, madvise, mprotect, mlock, ...), it's not guaranteed to work due to (1). And we can try to replace pages such pages, but it's not guaranteed to be allowed due to (3). And as it's all transparent, we cannot fail (1). For the other cases that Willy and I discussed (split on VMA modifications after fork()), we can at least always replace the anon page.
What always works, is putting the THP on the deferred split queue to see if we can split it later. The deferred split queue is a bit suboptimal right now, because it requires the (sub)page mapcounts to detect whether the folio is partially mapped vs. fully mapped. If we want to get rid of that, we have to come up with something reasonable. I was wondering if we could have a an optimized deferred split queue, that only conditionally splits: do an rmap walk and detect if (a) each page of the folio is still mapped (b) the folio does not cross a VMA. If both are met, one could skip the deferred split. But that needs a bit of thought -- but we're already doing an rmap walk when splitting, so scanning which parts are actually mapped does not sound too weird.
-- Cheers, David / dhildenb