From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C4E0C30653 for ; Wed, 3 Jul 2024 12:33:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A4C936B007B; Wed, 3 Jul 2024 08:33:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9D4736B0085; Wed, 3 Jul 2024 08:33:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 84F026B009A; Wed, 3 Jul 2024 08:33:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 62C1D6B007B for ; Wed, 3 Jul 2024 08:33:37 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2E8ECA11D9 for ; Wed, 3 Jul 2024 12:33:35 +0000 (UTC) X-FDA: 82298382390.12.A82C0FD Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf02.hostedemail.com (Postfix) with ESMTP id BC45A8000E for ; Wed, 3 Jul 2024 12:33:31 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720009981; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T48nfr2VfNIt7t0ymuU5VrJXZefLEtPRDT3TDAyY5oI=; b=dz87uBtJZ9JFWDh9XLe03PoXZwV6/+5ipmuDWshF9uojhFXZEnFn1th404tf9P9jn2wyxf Z6X9Pgr8illJEzKgcwGGbSnrnm6/lnmQGZSg777qzvrvd3MUd28U6NT9tARw9KzqPwcAHL mtqkK8p4y19UzVLjTz8VCe+BtP6fytE= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; spf=pass (imf02.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720009981; a=rsa-sha256; cv=none; b=ZSV2u8hhpmjxzpPnNUF7YDD7a/e9qzkdu9eLTJCRqb2fKKptPa880npH6lhCvwBdZDV4AX 2XzWnFPRoAdZGhpopY4NZd0t3jDPZaWxRzyfUm8eVf0rF5GSXsN0C68zybBf3zDQMYHlXl dtiMt41bg+cp5bCFPLiDwImOaBU0w5E= Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4WDfGG3Yv8z1T4tD; Wed, 3 Jul 2024 20:28:54 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id E58C3180087; Wed, 3 Jul 2024 20:33:27 +0800 (CST) Received: from [10.69.30.204] (10.69.30.204) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Wed, 3 Jul 2024 20:33:27 +0800 Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc' To: Alexander H Duyck , , , CC: , , Andrew Morton , References: <20240625135216.47007-1-linyunsheng@huawei.com> <20240625135216.47007-7-linyunsheng@huawei.com> <12a8b9ddbcb2da8431f77c5ec952ccfb2a77b7ec.camel@gmail.com> <808be796-6333-c116-6ecb-95a39f7ad76e@huawei.com> From: Yunsheng Lin Message-ID: Date: Wed, 3 Jul 2024 20:33:27 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.69.30.204] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemf200006.china.huawei.com (7.185.36.61) X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: BC45A8000E X-Stat-Signature: h7zt9g76zami3hwe1hw6f13uidzkjoti X-Rspam-User: X-HE-Tag: 1720010011-211904 X-HE-Meta: U2FsdGVkX185QanNKAQoMZfebzfrIwVQjhJLs33p5BB7Am/Mp/m/OwUbV8O3TFWkBr5Fab4zFrch7b1fNPuGwK/UaeLjA+QQJM1fHX5ZSIEN5uKOd/5qytned1KZ3qch8D3YBvJSiwKMHGc9PqW44n/WgXIx6RKlqO91jRVfVMtwhCTK1mZtHREdl5uPHCroz1LKWrqcs1F7YFw4jbPL74VfTe4K+aQEwnFFQuGbyqFEaWBbputIRhPwK9uCgY0GwA5eDTxkWsAoEWlQNZ30Xbg3zRqLrWZfpp6xe2IaSnnpxmV70sl5hVgvaocEQ4sRIPNKY4Iez39Odzpqv4jIOpJqrluMAUTh3HGmFOc1jkTy/zYmI/hruwAvaP/TK831yzyTE6rL2Jl6bAAisdFf2wzArlkSXt6LocMw+JrHL3HE9cjiYFFItngCtgadhtQrjaqg5MNR5bI9z/T1ytILKwX4Or5M6g8BoLBjrfxm9c9UczhFkpUqK7P/wh44TAbRcWJ263OxQY7qJ6gQ8Vavr0uFjP5BMZM+Yj+/pSx+ZER5qaRz8GnFCYurNFCk5pGe0kBEDvtNz+74wRAidCXI1prsuI2uW4iCdZPgdfejtu0TnpAXodlDE36Fh4EwaDtpMlmiFyBkCalxhe4oYQeBcouuLYb/n2XCwNoUb/uiDPlGNKHFSwO2MdOk2Gyy6m0KnnqwD47jAJXmxaqgSWT5yZPOtJSa8gnQDdBJqB4Z2lUGi/yerww8o6AardYs0rIFF2Qy3RT4pzymsFVFsDASxFCc0lB4vhHpyVuJS8hNs6Rp19u89IKieHy38KmCDIxGsJnIWaBhw2YmE1qqFvtA9iSIXSc05GTkhEHfLox+0Yiq/FpEh6gGN3kcNHsEQKvY/rvxySCLNof5TVU6OESFyO394i8mUxlenA7Aq9T1eyDu+BhNuK5ttOykJCD+/0AlHtm0SN7goNZG0wbbCfK 1LaiikgW ZdnE9aNQOEDzIeal7N9b3UVandR5Uc8AN9g/vYvHgQz4meEmwjpJ3tnuGggXQfV/P18KVVs2On2P9BzRqJz2hqB+KV3fvPAKDp+/oRP2HyZKPVlY7NagU7oXaFVhH0ftldl1rkdiLDpz/qveG+M/4kmjdB6PUqsJeLuecyt+IMnkx/Xm9wOU7dZzPAh6b0CoBUebom5S/CC/7bQn1liZZdmjddpQFNuTNxBNqCHVSxwtKokcrHEYlm14WIev7cg/soGWZTIuivrQwx/bm6Pf/TO9ScB3Klgzog9DVZx/CFiaj0LkpLTKePhyKyw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/7/2 22:55, Alexander H Duyck wrote: ... > >>> >>>> +#define PAGE_FRAG_CACHE_ORDER_MASK GENMASK(7, 0) >>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT BIT(8) >>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT 8 >>>> + >>>> +static inline struct encoded_va *encode_aligned_va(void *va, >>>> + unsigned int order, >>>> + bool pfmemalloc) >>>> +{ >>>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) >>>> - __u16 offset; >>>> - __u16 size; >>>> + return (struct encoded_va *)((unsigned long)va | order | >>>> + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT); >>>> #else >>>> - __u32 offset; >>>> + return (struct encoded_va *)((unsigned long)va | >>>> + pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT); >>>> +#endif >>>> +} >>>> + > > This is missing any and all protection of the example you cited. If I > want to pass order as a 32b value I can and I can corrupt the virtual > address. Same thing with pfmemalloc. Lets only hope you don't hit an > architecture where a bool is a u8 in size as otherwise that shift is > going to wipe out the value, and if it is a u32 which is usually the > case lets hope they stick to the values of 0 and 1. I explicitly checked that the protection is not really needed due to performance consideration: 1. For the 'pfmemalloc' part, it does always stick to the values of 0 and 1 as below: https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053 2. For the 'order' part, its range can only be within 0~3. > >>>> +static inline unsigned long encoded_page_order(struct encoded_va *encoded_va) >>>> +{ >>>> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) >>>> + return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va; >>>> +#else >>>> + return 0; >>>> +#endif >>>> +} >>>> + >>>> +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va) >>>> +{ >>>> + return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va; >>>> +} >>>> + >>> >>> My advice is that if you just make encoded_va an unsigned long this >>> just becomes some FIELD_GET and bit operations. >> >> As above. >> > > The code you mentioned had one extra block of bits that was in it and > had strict protections on what went into and out of those bits. You > don't have any of those protections. As above, the protection masking/checking is explicitly avoided due to performance consideration and reasons as above for encoded_va. But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch possible future mistake. > > I suggest you just use a long and don't bother storing this as a > pointer. > ... >>>> - >>>> + remaining = nc->remaining & align_mask; >>>> + remaining -= fragsz; >>>> + if (unlikely(remaining < 0)) { >>> >>> Now this is just getting confusing. You essentially just added an >>> additional addition step and went back to the countdown approach I was >>> using before except for the fact that you are starting at 0 whereas I >>> was actually moving down through the page. >> >> Does the 'additional addition step' mean the additional step to calculate >> the offset using the new 'remaining' field? I guess that is the disadvantage >> by changing 'offset' to 'remaining', but it also some advantages too: >> >> 1. it is better to replace 'offset' with 'remaining', which >> is the remaining size for the cache in a 'page_frag_cache' >> instance, we are able to do a single 'fragsz > remaining' >> checking for the case of cache not being enough, which should be >> the fast path if we ensure size is zoro when 'va' == NULL by >> memset'ing 'struct page_frag_cache' in page_frag_cache_init() >> and page_frag_cache_drain(). >> 2. It seems more convenient to implement the commit/probe API too >> when using 'remaining' instead of 'offset' as those API needs >> the remaining size of the page_frag_cache anyway. >> >> So it is really a trade-off between using 'offset' and 'remaining', >> it is like the similar argument about trade-off between allocating >> fragment 'countdown' and 'countup' way. >> >> About confusing part, as the nc->remaining does mean how much cache >> is left in the 'nc', and nc->remaining does start from >> PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what >> you meant by 'countdown', but it is different from the 'offset countdown' >> before this patchset as my understanding. >> >>> >>> What I would suggest doing since "remaining" is a negative offset >>> anyway would be to look at just storing it as a signed negative number. >>> At least with that you can keep to your original approach and would >>> only have to change your check to be for "remaining + fragsz <= 0". >> >> Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like >> below? >> nc->remaining = -PAGE_SIZE or >> nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE > > Yes. Basically if we used it as a signed value then you could just work > your way up and do your aligned math still. For the aligned math, I am not sure how can 'align_mask' be appiled to a signed value yet. It seems that after masking nc->remaining leans towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for a unsigned value, for example: nc->remaining = -4094; nc->remaining &= -64; It seems we got -4096 for above, is that what we are expecting? > >> struct page_frag_cache { >> struct encoded_va *encoded_va; >> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32) >> u16 pagecnt_bias; >> s16 remaining; >> #else >> u32 pagecnt_bias; >> s32 remaining; >> #endif >> }; >> >> If I understand above correctly, it seems we really need a better name >> than 'remaining' to reflect that. > > It would effectively work like a countdown. Instead of T - X in this > case it is size - X. > >>> With that you can still do your math but it becomes an addition instead >>> of a subtraction. >> >> And I am not really sure what is the gain here by using an addition >> instead of a subtraction here. >> > > The "remaining" as it currently stands is doing the same thing so odds > are it isn't too big a deal. It is just that the original code was > already somewhat confusing and this is just making it that much more > complex. > >>>> + page = virt_to_page(encoded_va); >>>> if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) >>>> goto refill; >>>> >>>> - if (unlikely(nc->pfmemalloc)) { >>>> - free_unref_page(page, compound_order(page)); >>>> + if (unlikely(encoded_page_pfmemalloc(encoded_va))) { >>>> + VM_BUG_ON(compound_order(page) != >>>> + encoded_page_order(encoded_va)); >>>> + free_unref_page(page, encoded_page_order(encoded_va)); >>>> goto refill; >>>> } >>>> >>>> /* OK, page count is 0, we can safely set it */ >>>> set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); >>>> >>>> - /* reset page count bias and offset to start of new frag */ >>>> + /* reset page count bias and remaining of new frag */ >>>> nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; >>>> - offset = 0; >>>> - if (unlikely(fragsz > PAGE_SIZE)) { >>>> + nc->remaining = remaining = page_frag_cache_page_size(encoded_va); >>>> + remaining -= fragsz; >>>> + if (unlikely(remaining < 0)) { >>>> /* >>>> * The caller is trying to allocate a fragment >>>> * with fragsz > PAGE_SIZE but the cache isn't big >>> >>> I find it really amusing that you went to all the trouble of flipping >>> the logic just to flip it back to being a countdown setup. If you were >>> going to bother with all that then why not just make the remaining >>> negative instead? You could save yourself a ton of trouble that way and >>> all you would need to do is flip a few signs. >> >> I am not sure I understand the 'a ton of trouble' part here, if 'flipping >> a few signs' does save a ton of trouble here, I would like to avoid 'a >> ton of trouble' here, but I am not really understand the gain here yet as >> mentioned above. > > It isn't about flipping the signs. It is more the fact that the logic > has now become even more complex then it originally was. With my work > backwards approach the advantage was that the offset could be updated > and then we just recorded everything and reported it up. Now we have to I am supposing the above is referring to 'offset countdown' way before this patchset, right? > keep a temporary "remaining" value, generate our virtual address and > store that to a temp variable, we can record the new "remaining" value, > and then we can report the virtual address. If we get the ordering on Yes, I noticed it when coding too, that is why current virtual address is generated in page_frag_cache_current_va() basing on nc->remaining instead of the local variable 'remaining' before assigning the local variable 'remaining' to nc->remaining. But I am not sure I understand how using a signed negative number for 'remaining' will change the above steps. If not, I still fail to see the gain of using a signed negative number for 'nc->remaining'. > the steps 2 and 3 in this it will cause issues as we will be pointing > to the wrong values. It is something I can see someone easily messing > up. Yes, agreed. It would be good to be more specific about how to avoid the above problem using a signed negative number for 'remaining' as I am not sure how it can be done yet. >