Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Amir Goldstein <amir73il@gmail.com>
To: Omar Sandoval <osandov@osandov.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	Dave Chinner <david@fromorbit.com>, Jann Horn <jannh@google.com>,
	Aleksa Sarai <cyphar@cyphar.com>,
	Linux API <linux-api@vger.kernel.org>,
	kernel-team@fb.com, Michael Kerrisk <mtk.manpages@gmail.com>,
	linux-man <linux-man@vger.kernel.org>
Subject: Re: [PATCH man-pages v5] Document encoded I/O
Date: Fri, 21 Aug 2020 12:24:48 +0300	[thread overview]
Message-ID: <CAOQ4uxgEpYqQ9MeuS=76tOtjFCrL8urkDoPoHxu+A5s4C2HGRA@mail.gmail.com> (raw)
In-Reply-To: <64cc229872230dc6998a3dbf2264513870a8a6f6.1597994017.git.osandov@osandov.com>

On Fri, Aug 21, 2020 at 10:38 AM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This adds a new page, encoded_io(7), providing an overview of encoded
> I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> reference it.
>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: linux-man <linux-man@vger.kernel.org>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---

Omar,

Thanks for making the clarifications. Some questions below.

[...]

> +.PP
> +As the filesystem page cache typically contains decoded data,
> +encoded I/O bypasses the page cache.
> +.SS Extent layout
> +By using
> +.IR len ,
> +.IR unencoded_len ,
> +and
> +.IR unencoded_offset ,
> +it is possible to refer to a subset of an unencoded extent.
> +.PP
> +In the simplest case,
> +.I len
> +is equal to
> +.I unencoded_len
> +and
> +.I unencoded_offset
> +is zero.
> +This means that the entire unencoded extent is used.
> +.PP
> +However, suppose we read 50 bytes into a file
> +which contains a single compressed extent.
> +The filesystem must still return the entire compressed extent
> +for us to be able to decompress it,
> +so
> +.I unencoded_len
> +would be the length of the entire decompressed extent.
> +However, because the read was at offset 50,
> +the first 50 bytes should be ignored.
> +Therefore,
> +.I unencoded_offset
> +would be 50,
> +and
> +.I len
> +would accordingly be
> +.IR unencoded_len\ -\ 50 .
> +.PP
> +Additionally, suppose we want to create an encrypted file with length 500,
> +but the file is encrypted with a block cipher using a block size of 4096.
> +The unencoded data would therefore include the appropriate padding,
> +and
> +.I unencoded_len
> +would be 4096.
> +However, to represent the logical size of the file,
> +.I len
> +would be 500
> +(and
> +.I unencoded_offset
> +would be 0).
> +.PP
> +Similar situations can arise in other cases:
> +.IP * 3
> +If the filesystem pads data to the filesystem block size before compressing,
> +then compressed files with a size unaligned to the filesystem block size will
> +end with an extent with
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +Extents cloned from the middle of a larger encoded extent with
> +.B FICLONERANGE
> +may have a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.IR unencoded_len .
> +.IP *
> +If the middle of an encoded extent is overwritten,
> +the filesystem may create extents with a non-zero
> +.I unencoded_offset
> +and/or
> +.I len
> +<
> +.I unencoded_len
> +for the parts that were not overwritten.

So in this case, would the reader be getting extents "out of unencoded order"?
e.g. unencoded range 0..4096 and then unencoded range 10..20?
Or would reader be reading the encoded full block twice, once for
ragne 0..10 and once for range 20..4096?



> +.SS Security
> +Encoded I/O creates the potential for some security issues:
> +.IP * 3
> +Encoded writes allow writing arbitrary data which the kernel will decode on
> +a subsequent read. Decompression algorithms are complex and may have bugs
> +which can be exploited by maliciously crafted data.
> +.IP *
> +Encoded reads may return data which is not logically present in the file
> +(see the discussion of
> +.I len
> +vs.
> +.I unencoded_len
> +above).
> +It may not be intended for this data to be readable.
> +.PP
> +Therefore, encoded I/O requires privilege.
> +Namely, the
> +.B RWF_ENCODED
> +flag may only be used when the file was opened with the
> +.B O_ALLOW_ENCODED
> +flag to
> +.BR open (2),
> +which requires the
> +.B CAP_SYS_ADMIN
> +capability.
> +.B O_ALLOW_ENCODED
> +may be set and cleared with
> +.BR fcntl (2).
> +Note that it is not cleared on
> +.BR fork (2)
> +or
> +.BR execve (2);
> +one may wish to use
> +.B O_CLOEXEC
> +with
> +.BR O_ALLOW_ENCODED .
> +.SS Filesystem support
> +Encoded I/O is supported on the following filesystems:
> +.TP
> +Btrfs (since Linux 5.10)
> +.IP
> +Btrfs supports encoded reads and writes of compressed data.
> +The data is encoded as follows:
> +.RS
> +.IP * 3
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZLIB ,
> +then the encoded data is a single zlib stream.
> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_LZO ,
> +then the encoded data is compressed page by page with LZO1X
> +and wrapped in the format documented in the Linux kernel source file
> +.IR fs/btrfs/lzo.c .

:-/ So maybe call it ENCODED_IOV_COMPRESSION_BTRFS_LZO?

I understand why you want the encoding format not to be opaque, but
I imagine the encoded data is not going to be migrated as is between
different filesystems. So just call it for what it is - a private
filesystem encoding
format. If you have a format that is standard and other filesystems are likely
to use, fine, but let's not make an API that discourages using
"private" encoding, just for the sake of it and make life harder for no good
reason.

All the reader of this man page may be interested to know is which
filesystems are expected to support which encoding types and a general
description of what they mean (as you did).
Making this page wrongly appear as a standard for encoding formats is not
going to play out well...

> +.IP *
> +If
> +.I compression
> +is
> +.BR ENCODED_IOV_COMPRESSION_ZSTD ,
> +then the encoded data is a single zstd frame compressed with the
> +.I windowLog
> +compression parameter set to no more than 17.

Even that small detail is a bit limiting to filesystems and should
therefore be tagged as a private btrfs encoding IMO.

Thanks,
Amir.

  reply	other threads:[~2020-08-21  9:25 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21  7:38 [PATCH v5 0/9] fs: interface for directly reading/writing compressed data Omar Sandoval
2020-08-21  7:38 ` [PATCH man-pages v5] Document encoded I/O Omar Sandoval
2020-08-21  9:24   ` Amir Goldstein [this message]
2020-08-24 18:15     ` Omar Sandoval
2020-08-21  7:38 ` [PATCH v5 1/9] iov_iter: add copy_struct_from_iter() Omar Sandoval
2020-08-24 18:52   ` Josef Bacik
2020-08-24 21:09     ` Omar Sandoval
2020-08-21  7:38 ` [PATCH v5 2/9] fs: add O_ALLOW_ENCODED open flag Omar Sandoval
2020-08-24 18:28   ` Josef Bacik
2020-08-24 21:11     ` Omar Sandoval
2020-08-21  7:38 ` [PATCH v5 3/9] fs: add RWF_ENCODED for reading/writing compressed data Omar Sandoval
2020-08-21  8:47   ` Amir Goldstein
2020-08-24 23:49     ` Omar Sandoval
2020-08-25  8:25       ` Amir Goldstein
2020-08-25 17:20         ` Omar Sandoval
2020-08-24 19:07   ` Josef Bacik
2020-08-21  7:38 ` [PATCH v5 4/9] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Omar Sandoval
2020-08-24 19:17   ` Josef Bacik
2020-08-21  7:38 ` [PATCH v5 5/9] btrfs: add ram_bytes and offset to btrfs_ordered_extent Omar Sandoval
2020-08-24 19:23   ` Josef Bacik
2020-08-21  7:38 ` [PATCH v5 6/9] btrfs: support different disk extent size for delalloc Omar Sandoval
2020-08-24 19:26   ` Josef Bacik
2020-08-21  7:38 ` [PATCH v5 7/9] btrfs: optionally extend i_size in cow_file_range_inline() Omar Sandoval
2020-08-24 19:33   ` Josef Bacik
2020-08-21  7:38 ` [PATCH v5 8/9] btrfs: implement RWF_ENCODED reads Omar Sandoval
2020-08-24 19:54   ` Josef Bacik
2020-08-24 21:23     ` Omar Sandoval
2020-08-21  7:38 ` [PATCH v5 9/9] btrfs: implement RWF_ENCODED writes Omar Sandoval
2020-08-24 20:30   ` Josef Bacik
2020-08-24 21:30     ` Omar Sandoval

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOQ4uxgEpYqQ9MeuS=76tOtjFCrL8urkDoPoHxu+A5s4C2HGRA@mail.gmail.com' \
    --to=amir73il@gmail.com \
    --cc=cyphar@cyphar.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jannh@google.com \
    --cc=kernel-team@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=mtk.manpages@gmail.com \
    --cc=osandov@osandov.com \
    --cc=viro@zeniv.linux.org.uk \
    --subject='Re: [PATCH man-pages v5] Document encoded I/O' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).