Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
To: tytso@mit.edu, david@fromorbit.com, bpm@sgi.com, olaf@sgi.com
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	kernel@lists.collabora.co.uk, alvaro.soliverez@collabora.co.uk,
	Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Subject: [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4
Date: Fri, 12 Jan 2018 05:12:21 -0200	[thread overview]
Message-ID: <20180112071234.29470-1-krisman@collabora.co.uk> (raw)

Hi,

In the past few months, I've been working to support case-insensitive
lookups of utf8 encoded strings, primarily for EXT4, and then for other
filesystems.  This RFC uses the awesome UTF8 normalization
implementation done by the SGI guys in 2014, namely Olaf Weber and Ben
Myers, but it, unfortunately, never went upstream.  That SGI effort was
made of 3 versions of an RFC submitted to this list, and the last
version was archived below:

https://www.spinics.net/lists/xfs/msg30069.html

For normalization support, I basically rebased those patches and
addressed the issues that where raised on the list at that time.  I also
implemented an extension to do some testing of the exported functions in
kernelspace, to make sure we can catch regressions early.  Obviously,
more tests are needed, particularly for Hangul alorithmic decomposition.

Like the original submission from Ben, I excluded the commit that
includes the generated header file and unicode files because they are
too big and would bounce the list.  Instead, instructions on fetching
and generating the files are documented in the commit message.

An important difference to the original SGI patches is that I have
introduced a midlayer API between the low-level normalization code and
the userfilesystem usercode.  The goal is to hide implementation details
behind a more simple interface of strncmp()/strcasecmp()-like functions,
as well as a more specific casefold() operation, which implements the
behavior defined by the unicode spec.  This reduces filesystem changes
to a minimal.  As a quick example, the fs code can load a struct
charset, which is decided by the encoding mount parameter or sb
information and then call the helpers charset_strncmp or
charset_strncasecmp when matching names.

This implementation has an obvious intersection with the NLS code
already in the kernel.  It holds a few differences, though, like
implementing some higher-level functions instead of toupper/tolower
functions, which are not enough for full caseless comparison, and it
also supports versioning of the encoding, which is required to ensure
stability of case-folding operations.  If the community understands we
should merge these changes back to the NLS code, I can work on it, but
it should require some reworking on how the NLS system is implemented.

The charsets code doesn't do any locking on the module or refcounts the
registered encoding modules yet.  I was assuming I would be asked to
merge it into NLS, so I would rather discuss this change first, rather
than polish final details in advance.

The ext4 insensitive-lookup doesn't require any on-disk changes.  It has
a performance hit for huge directories since if the lookup doesn't use
the exact case, we will fallback to linear search.  This is a
performance problem, but it feels acceptable for now.

Right now, with the RFC applied, you can mount an existing ext4
filesystem with:

mount -o encoding=utf8-7.0.0 /dev/sdaX /mnt

And perform lookups of compatible sequences (NKFD), the filesystem
should successfully complete the lookup.  If you add 'ignorecase' as a
mountoption, casefolding will be performed and caseless matching of
compatible sequences should work.

Finally, Thank you Olaf and Ben for your work on the normalization
patches.  I am really looking forward to have your contribuitions
merged, so I'd love to hear people thoughts and suggestions on what is
needed for upstream acceptance.

Gabriel Krisman Bertazi (9):
  charsets: Introduce middle-layer for character encoding
  charsets: ascii: Wrap ascii functions to charsets library
  charsets: utf8: Hook-up utf-8 code to charsets library
  charsets: utf8: Introduce test module for kernel UTF-8 implementation
  ext4: Add ignorecase mount option
  ext4: Include encoding information on the superblock
  fscrypt: Introduce charset-based matching functions
  ext4: Support charset name matching
  ext4: Implement ext4 dcache hooks for custom charsets

Olaf Weber (4):
  charsets: utf8: Add unicode character database files
  scripts: add trie generator for UTF-8
  charsets: utf8: Introduce code for UTF-8 normalization
  charsets: utf8: reduce the size of utf8data[]

 fs/ext4/dir.c                   |   63 +
 fs/ext4/ext4.h                  |    6 +
 fs/ext4/namei.c                 |   27 +-
 fs/ext4/super.c                 |   35 +
 include/linux/charsets.h        |   73 +
 include/linux/fscrypt.h         |    1 +
 include/linux/fscrypt_notsupp.h |   16 +
 include/linux/fscrypt_supp.h    |   27 +
 include/linux/utf8norm.h        |  116 ++
 lib/Kconfig                     |   16 +
 lib/Makefile                    |    2 +
 lib/charsets/Makefile           |   24 +
 lib/charsets/ascii.c            |   98 ++
 lib/charsets/core.c             |   68 +
 lib/charsets/test_ucd.c         |  186 +++
 lib/charsets/ucd/README         |   33 +
 lib/charsets/utf8_core.c        |  178 ++
 lib/charsets/utf8norm.c         |  794 +++++++++
 scripts/Makefile                |    1 +
 scripts/mkutf8data.c            | 3464 +++++++++++++++++++++++++++++++++++++++
 20 files changed, 5219 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/charsets.h
 create mode 100644 include/linux/utf8norm.h
 create mode 100644 lib/charsets/Makefile
 create mode 100644 lib/charsets/ascii.c
 create mode 100644 lib/charsets/core.c
 create mode 100644 lib/charsets/test_ucd.c
 create mode 100644 lib/charsets/ucd/README
 create mode 100644 lib/charsets/utf8_core.c
 create mode 100644 lib/charsets/utf8norm.c
 create mode 100644 scripts/mkutf8data.c

-- 
2.15.1

             reply	other threads:[~2018-01-12  7:13 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12  7:12 Gabriel Krisman Bertazi [this message]
2018-01-12  7:12 ` [PATCH RFC 01/13] charsets: Introduce middle-layer for character encoding Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 02/13] charsets: ascii: Wrap ascii functions to charsets library Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 03/13] charsets: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-01-12 16:59   ` Darrick J. Wong
2018-01-12 20:29     ` Weber, Olaf (HPC Data Management & Storage)
2018-01-13  0:24   ` Theodore Ts'o
2018-01-13  4:28     ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 04/13] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 05/13] charsets: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 06/13] charsets: utf8: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library Gabriel Krisman Bertazi
2018-01-12 10:38   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-16 16:50     ` Gabriel Krisman Bertazi
2018-01-16 22:19       ` Weber, Olaf (HPC Data Management & Storage)
2018-01-23  3:33         ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 09/13] ext4: Add ignorecase mount option Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 10/13] ext4: Include encoding information on the superblock Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 11/13] fscrypt: Introduce charset-based matching functions Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 12/13] ext4: Support charset name matching Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets Gabriel Krisman Bertazi
2018-01-12 10:52   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-12 16:56 ` [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Jeremy Allison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180112071234.29470-1-krisman@collabora.co.uk \
    --to=krisman@collabora.co.uk \
    --cc=alvaro.soliverez@collabora.co.uk \
    --cc=bpm@sgi.com \
    --cc=david@fromorbit.com \
    --cc=kernel@lists.collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=olaf@sgi.com \
    --cc=tytso@mit.edu \
    --subject='Re: [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).