From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B42F5C02190 for ; Tue, 28 Jan 2025 08:45:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36DFB28020E; Tue, 28 Jan 2025 03:45:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 31DB1280209; Tue, 28 Jan 2025 03:45:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E6F528020E; Tue, 28 Jan 2025 03:45:47 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 00DBB280209 for ; Tue, 28 Jan 2025 03:45:46 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id AD38EB122B for ; Tue, 28 Jan 2025 08:45:46 +0000 (UTC) X-FDA: 83056227492.14.A141573 Received: from mail-ed1-f41.google.com (mail-ed1-f41.google.com [209.85.208.41]) by imf11.hostedemail.com (Postfix) with ESMTP id 9736D40008 for ; Tue, 28 Jan 2025 08:45:44 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=owltronix-com.20230601.gappssmtp.com header.s=20230601 header.b=srtcTXtk; dmarc=none; spf=none (imf11.hostedemail.com: domain of hans@owltronix.com has no SPF policy when checking 209.85.208.41) smtp.mailfrom=hans@owltronix.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738053944; a=rsa-sha256; cv=none; b=P5pV6ILZJJyH8RUlLnOAPb2+TMFAtDnoUCyAxItAhVBIbk8KMgIFt4biuWxaPaGeaeft9h HHbmWiI21ZAg3NDWNcq47XnLByaDSNVBVJqqK+kTm0LQEkPjHkhGTP3n0131k7lMrPzJn6 X1HmrRseL3omwZ+Urbn9Gc796Ygorlg= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=owltronix-com.20230601.gappssmtp.com header.s=20230601 header.b=srtcTXtk; dmarc=none; spf=none (imf11.hostedemail.com: domain of hans@owltronix.com has no SPF policy when checking 209.85.208.41) smtp.mailfrom=hans@owltronix.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738053944; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Hi1PDYct20yoCac+m36vvt1TMeZnhSDK67+M1pJQBfg=; b=IDSBkAs/MFuC4Oub9E2i9jwA5G0FKzcrbqLiSEA/2vPw9D4bbRgMJIdEg6CqtY0IxVOZ9+ /N/es3FNCQXwKWsgQTXap4CYYfpeFwJ+xkllr4hM+T8QRelCOkbIDDNfqxorn/2EL8+Ddo 2Na2XbmmVME5kx7lTHMeNMgTOsQvfpw= Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-5db6890b64eso10900218a12.3 for ; Tue, 28 Jan 2025 00:45:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=owltronix-com.20230601.gappssmtp.com; s=20230601; t=1738053943; x=1738658743; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Hi1PDYct20yoCac+m36vvt1TMeZnhSDK67+M1pJQBfg=; b=srtcTXtkNbsce0/QhvQXVKH4T/mhO7T+1qA/Z8s5zrGa3mrkauagp2rHngfvpIv/7L HzGkgzgwwO1xparobH6wsPPorDjJLVLf4+Bfp4zUZVdO9AjI7Ca+JZa81MTCo+huxXGl R1grVvVfYY3JD+WcqumGrgQIpUhbftXGBOyV6/WfKDgR4Tn2Zomp/cAmKGhn89dQEikT f9HrBWIJgBV43abLtB6PwOzP/CVOrVpPxEVkcj4qn4X1ATjnwVJW3FcFrXmbUrs5qXJc zbEBwG6ZLarZtJnq2YdmVXxyflbc+VybdCVJHsIn7vQdd1lpKNdVOR8iWdVBVUMi6St+ UGmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738053943; x=1738658743; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Hi1PDYct20yoCac+m36vvt1TMeZnhSDK67+M1pJQBfg=; b=EDWosWIFmV2/qrIBDNNoeurE2O8OXR2a+RtCrKJmDQwf/BMtLfcbUJtnVdIHwzXG+Z melwgZJ3H9w7nurH1pNMj/c7BV6m4WyIWt+NFFMs+Wxw64H3iCsr2/6RxgkXD9BzWgoo /nK4degv3nOVbULQA1aAsoLtWnYBkqPxOJSn6Di40eD6rVajtJXmL4fcU751yDi1j06K B0r9eM2grPVi8tgfq2O/fJzsivRk7Q60t/V9D0u1ypHhVpLSPcZgGT0+IXuwM9dqvRM4 vslet9I5OsUochUumcqu0eD9tazrryV+1q5he/XZgA7Jbbqa4uhXO2Ce2XD7LN3PlZe7 Ry4A== X-Forwarded-Encrypted: i=1; AJvYcCUQ8uXYGgllvue/GT+moxNmo6kMbNKosnQ0BQxXcgniQSxEZPplekOHRyfyINMdqx5tgJg4R9z9aw==@kvack.org X-Gm-Message-State: AOJu0YzXPWDo8lhFZgXUIbGxZRLzxWGwliG+ns5OnAJb7jZDaZQ+NAg0 1I2g77ax/3WZKdYuoSsoDL/BHIpgLb11VfBB4jqmDRU4DcpEcvf/NE1E0Hw8Yk4VgcFZIwZ2J03 5lot3WCOQAIj/vcsbDWneWnHwNWTOo4SlmYDo6Q== X-Gm-Gg: ASbGncvse0+qKwXvEskWqq2LSxDExtio/XL+WTyZwewWDm3OPU5wtxBRaz+a732Hx4M 5NlnY3FSkjEbgjtxlvDsuWPMjjyYswlbVTfHfXKcYGgL6AxSCNKY3UP+ogLC7ibohuDg/79+z X-Google-Smtp-Source: AGHT+IHAGnRsTwVr80LuhfGtSL3RBmru+u3FKvFsIpw7jjpy58rmVEPdwza58ZSzMeDSjBCXWVedDCMIGGLbazAG350= X-Received: by 2002:a17:907:1c2a:b0:aac:832:9bf7 with SMTP id a640c23a62f3a-ab38b27be47mr3909003166b.24.1738053942699; Tue, 28 Jan 2025 00:45:42 -0800 (PST) MIME-Version: 1.0 References: <20250123202455.11338-1-slava@dubeyko.com> <063856b9c67289b1dd979a12c8cfe8d203786acc.camel@ibm.com> In-Reply-To: <063856b9c67289b1dd979a12c8cfe8d203786acc.camel@ibm.com> From: Hans Holmberg Date: Tue, 28 Jan 2025 09:45:30 +0100 X-Gm-Features: AWEUYZkCjn7l1RjVmW25YZJ8K7woj0q0aPOGpEvKeH8Ngm4K4rQxLrNq_zx5pTo Message-ID: Subject: Re: [RFC PATCH] Introduce generalized data temperature estimation framework To: Viacheslav Dubeyko Cc: "Johannes.Thumshirn@wdc.com" , "linux-mm@kvack.org" , "slava@dubeyko.com" , "linux-fsdevel@vger.kernel.org" , "linux-block@vger.kernel.org" , "javier.gonz@samsung.com" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 9736D40008 X-Rspamd-Server: rspam10 X-Stat-Signature: m7qbwa5wc84ryye3g7hz1r77x84713fu X-HE-Tag: 1738053944-496179 X-HE-Meta: U2FsdGVkX18l2osLCYqali1RBT/6vWRTQWisiwdCBfAkbsRs4+AG0PDqkVGfMwx3WtzAePYE/op7u4zlWwLPFNKGHtqBtlUY42ArG5dcJMBuYhPlKqbwFyk11/mBCI0OmZoQL7mBQAq9fvFcGp5YYJf5lYjeBX8ycfAV7C5dAbyZtYcbuYhPOM3BlyeKyCamPoCaVJlxPYATWn0OIsIpK+GnD2ekzgiWtuLEkuBsScEP4qeuo3NN8dZhuK0oIeGhwoJgEgk236LUDXqYSeO/yj1UgZdvsNS46gcCrf6UGkZSbApre7RWlIXbzCwxPhCbuoaE4IIWJ+dFMsDOeI36HhU32o8NH2pAm/DxljRA6Uxs1OvTDsc8jTWDQnZ0AtpywlXK8YbXURGpUh4fJbSrRuq8iqms/d4j1gkMXVOyQgcPLroCtoniycKVRu1dhm5AfUpUW5bwN9QObE4zmpQm2gC8rLPTjV+LFsxCmLW8LG6aeFxI9HgVD8VE/g4EIM5XBnLOmIzkH2HTXsUaKB2Y+Iqmlux4eRER2juhFbkCx099MG+eglHLf9E+tOJ31naYByXqTYjnByB1hP4wgv4d64W0kLtLAYvJXxX8goAVroDYtjxkk3OYeG0ykHmhAfVqXQokhi/YBgfZ7uEx9D/MomWI7lUx0yaKrwpq9+iFGmg0V6T9WN3GWJxQyvNp6HGMAd5n2r/L8QJfnV7nqZvJx7OHdrZD34k7qLJLgHM3Q+G8OSaBLB9NoWViPPwLug/AnBA9FlnIZwECx2fJJq1x82qIZx21Y0oAHaQkB6MmMqeXZMJLVWwNv7YmmvRRjDrQ/aOLDbRCt3NejKgXT8ipF66iGzAaqojUoUKsJ4LpWetJgNu6V1YGGqVeilQBqNhHXj3CEckg6kV7CuFfmz2zdOkHprcMSIifJsKldI1MQ2KK0rFldUslVsIES7VbQuEu2E5FgKB+DJ0MVlAyzpP uEBrJLC6 sJTHqMm05oW2i4BDFo9cR401YvhI8KQHn5NP8JcB+2f2rjBTtCyOtk2JyEiMq+zXb+ZBIB+rtHkZx0tU2zlcvQdoM0337Uh3jgynjPlwlz9z9uOF4CvILVRBmEY0N7aLlEuylDWkSjTrLWPp/d+rJmfFNOEO+HUYgyk+wbNsORygE8o1q7XpLE0Lw5zQn4l9HsPXHN5S5c2sF5OEoHoT0eHM4xr0lPNE9RwO4R6RU/IM+ZDtDTDZ62b5Fc+iu1tnxO+gui0YtVXSa4suRKhCNPUxZg+qN+dKCIlFgUViIGF3gayo= X-Bogosity: Unsure, tests=bogofilter, spamicity=0.481968, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 27, 2025 at 9:59=E2=80=AFPM Viacheslav Dubeyko wrote: > > On Mon, 2025-01-27 at 15:19 +0100, Hans Holmberg wrote: > > On Fri, Jan 24, 2025 at 10:03=E2=80=AFPM Viacheslav Dubeyko > > wrote: > > > > > > On Fri, 2025-01-24 at 08:19 +0000, Johannes Thumshirn wrote: > > > > On 23.01.25 21:30, Viacheslav Dubeyko wrote: > > > > > [PROBLEM DECLARATION] > > > > > Efficient data placement policy is a Holy Grail for data > > > > > storage and file system engineers. Achieving this goal is > > > > > equally important and really hard. Multiple data storage > > > > > and file system technologies have been invented to manage > > > > > the data placement policy (for example, COW, ZNS, FDP, etc). > > > > > But these technologies still require the hints related to > > > > > nature of data from application side. > > > > > > > > > > [DATA "TEMPERATURE" CONCEPT] > > > > > One of the widely used and intuitively clear idea of data > > > > > nature definition is data "temperature" (cold, warm, > > > > > hot data). However, data "temperature" is as intuitively > > > > > sound as illusive definition of data nature. Generally > > > > > speaking, thermodynamics defines temperature as a way > > > > > to estimate the average kinetic energy of vibrating > > > > > atoms in a substance. But we cannot see a direct analogy > > > > > between data "temperature" and temperature in physics > > > > > because data is not something that has kinetic energy. > > > > > > > > > > [WHAT IS GENERALIZED DATA "TEMPERATURE" ESTIMATION] > > > > > We usually imply that if some data is updated more > > > > > frequently, then such data is more hot than other one. > > > > > But, it is possible to see several problems here: > > > > > (1) How can we estimate the data "hotness" in > > > > > quantitative way? (2) We can state that data is "hot" > > > > > after some number of updates. It means that this > > > > > definition implies state of the data in the past. > > > > > Will this data continue to be "hot" in the future? > > > > > Generally speaking, the crucial problem is how to define > > > > > the data nature or data "temperature" in the future. > > > > > Because, this knowledge is the fundamental basis for > > > > > elaboration an efficient data placement policy. > > > > > Generalized data "temperature" estimation framework > > > > > suggests the way to define a future state of the data > > > > > and the basis for quantitative measurement of data > > > > > "temperature". > > > > > > > > > > [ARCHITECTURE OF FRAMEWORK] > > > > > Usually, file system has a page cache for every inode. And > > > > > initially memory pages become dirty in page cache. Finally, > > > > > dirty pages will be sent to storage device. Technically > > > > > speaking, the number of dirty pages in a particular page > > > > > cache is the quantitative measurement of current "hotness" > > > > > of a file. But number of dirty pages is still not stable > > > > > basis for quantitative measurement of data "temperature". > > > > > It is possible to suggest of using the total number of > > > > > logical blocks in a file as a unit of one degree of data > > > > > "temperature". As a result, if the whole file was updated > > > > > several times, then "temperature" of the file has been > > > > > increased for several degrees. And if the file is under > > > > > continous updates, then the file "temperature" is growing. > > > > > > > > > > We need to keep not only current number of dirty pages, > > > > > but also the number of updated pages in the near past > > > > > for accumulating the total "temperature" of a file. > > > > > Generally speaking, total number of updated pages in the > > > > > nearest past defines the aggregated "temperature" of file. > > > > > And number of dirty pages defines the delta of > > > > > "temperature" growth for current update operation. > > > > > This approach defines the mechanism of "temperature" growth. > > > > > > > > > > But if we have no more updates for the file, then > > > > > "temperature" needs to decrease. Starting and ending > > > > > timestamps of update operation can work as a basis for > > > > > decreasing "temperature" of a file. If we know the number > > > > > of updated logical blocks of the file, then we can divide > > > > > the duration of update operation on number of updated > > > > > logical blocks. As a result, this is the way to define > > > > > a time duration per one logical block. By means of > > > > > multiplying this value (time duration per one logical > > > > > block) on total number of logical blocks in file, we > > > > > can calculate the time duration of "temperature" > > > > > decreasing for one degree. Finally, the operation of > > > > > division the time range (between end of last update > > > > > operation and begin of new update operation) on > > > > > the time duration of "temperature" decreasing for > > > > > one degree provides the way to define how many > > > > > degrees should be subtracted from current "temperature" > > > > > of the file. > > > > > > > > > > [HOW TO USE THE APPROACH] > > > > > The lifetime of data "temperature" value for a file > > > > > can be explained by steps: (1) iget() method sets > > > > > the data "temperature" object; (2) folio_account_dirtied() > > > > > method accounts the number of dirty memory pages and > > > > > tries to estimate the current temperature of the file; > > > > > (3) folio_clear_dirty_for_io() decrease number of dirty > > > > > memory pages and increases number of updated pages; > > > > > (4) folio_account_dirtied() also decreases file's > > > > > "temperature" if updates hasn't happened some time; > > > > > (5) file system can get file's temperature and > > > > > to share the hint with block layer; (6) inode > > > > > eviction method removes and free the data "temperature" > > > > > object. > > > > > > > > I don't want to pour gasoline on old flame wars, but what is the > > > > advantage of this auto-magic data temperature framework vs the exis= ting > > > > framework? > > > > > > > > > > There is no magic in this framework. :) It's simple and compact frame= work. > > > > > > > 'enum rw_hint' has temperature in the range of none, short, > > > > medium, long and extreme (what ever that means), can be set by an > > > > application via an fcntl() and is plumbed down all the way to the b= io > > > > level by most FSes that care. > > > > > > I see your point. But the 'enum rw_hint' defines qualitative grades a= gain: > > > > > > enum rw_hint { > > > WRITE_LIFE_NOT_SET =3D RWH_WRITE_LIFE_NOT_SET, > > > WRITE_LIFE_NONE =3D RWH_WRITE_LIFE_NONE, > > > WRITE_LIFE_SHORT =3D RWH_WRITE_LIFE_SHORT, <-- HOT da= ta > > > WRITE_LIFE_MEDIUM =3D RWH_WRITE_LIFE_MEDIUM, <-- WARM d= ata > > > WRITE_LIFE_LONG =3D RWH_WRITE_LIFE_LONG, <-- COLD d= ata > > > WRITE_LIFE_EXTREME =3D RWH_WRITE_LIFE_EXTREME, > > > } __packed; > > > > > > First of all, again, it's hard to compare the hotness of different fi= les > > > on such qualitative basis. Secondly, who decides what is hotness of a= particular > > > data? People can only guess or assume the nature of data based on > > > experience in the past. But workloads are changing and evolving > > > continuously and in real-time manner. Technically speaking, applicati= on can > > > try to estimate the hotness of data, but, again, file system can rece= ive > > > requests from multiple threads and multiple applications. So, applica= tion > > > can guess about real nature of data too. Especially, nobody would lik= e > > > to implement dedicated logic in application for data hotness estimati= on. > > > > > > This framework is inode based and it tries to estimate file's > > > "temperature" on quantitative basis. Advantages of this framework: > > > (1) we don't need to guess about data hotness, temperature will be > > > calculated quantitatively; (2) quantitative basis gives opportunity > > > for fair comparison of different files' temperature; (3) file's tempe= rature > > > will change with workload(s) changing in real-time; (4) file's > > > temperature will be correctly accounted under the load from multiple > > > applications. I believe these are advantages of the suggested framewo= rk. > > > > > > > While I think the general idea(using file-overwrite-rates as a > > parameter when doing data placement) could be useful, it could not > > replace the user space hinting we already have. > > > > Applications(e.g. RocksDB) doing sequential writes to files that are > > immutable until deleted(no overwrites) would not benefit. We need user > > space help to estimate data lifetime for those workloads and the > > relative write lifetime hints are useful for that. > > > > I don't see any competition or conflict here. Suggested approach and user= -space > hinting could be complementary techniques. If user-space logic would like= to use > a special data placement policy, then it can share hints in its own way. = But, > potentially, suggested approach of temperature calculation can be used to= check > the effectiveness of the user-space hinting, and, maybe, correcting it. S= o, I > don't see any conflict here. I don't see a conflict here either, my point is just that this framework cannot replace the user hints. > > > So what I am asking myself is if this framework is added, who would > > benefit? Without any benchmark results it's a bit hard to tell :) > > > > Which benefits would you like to see? I assume we would like: (1) prolong= device > lifetime, (2) improve performance, (3) decrease GC burden. Do you mean th= ese > benefits? Yep, decreased write amplification essentially. > > As far as I can see, different file systems can use temperature in differ= ent > way. And this is slightly complicates the benchmarking. So, how can we de= fine > the effectiveness here and how can we measure it? Do you have a vision he= re? I > am happy to make more benchmarking. > > My point is that the calculated file's temperature gives the quantitative= way to > distribute even user data among several temperature groups ("baskets"). A= nd > these baskets/segments/anything-else gives the way to properly group data= . File > systems can employ the temperature in various ways, but it can definitely= helps > to elaborate proper data placement policy. As a result, GC burden can be > decreased, performance can be improved, and lifetime device can be prolon= g. So, > how can we benchmark these points? And which approaches make sense to com= pare? > To start off, it would be nice to demonstrate that write amplification decreases for some workload when the temperature is taken into account. It would be great if the workload would be an actual application workload or a synthetic one mimicking some real-world-like use case. Run the same workload twice, measure write amplification and compare result= s. What user workloads do you see benefiting from this framework? Which would = not? > > Also, is there a good reason for only supporting buffered io? Direct > > IO could benefit in the same way, right? > > > > I think that Direct IO could benefit too. The question here how to accoun= t dirty > memory pages and updated memory pages. Currently, I am using > folio_account_dirtied() and folio_clear_dirty_for_io() to implement the > calculation the temperature. As far as I can see, Direct IO requires anot= her > methods of doing this. The rest logic can be the same. It's probably a good idea to cover direct IO as well then as this is intended to be a generalized framework.