Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
To: "Weber\, Olaf \(HPC Data Management & Storage\)" <olaf.weber@hpe.com>
Cc: "tytso\@mit.edu" <tytso@mit.edu>,
	"david\@fromorbit.com" <david@fromorbit.com>,
	"linux-ext4\@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-fsdevel\@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"kernel\@lists.collabora.co.uk" <kernel@lists.collabora.co.uk>,
	"alvaro.soliverez\@collabora.co.uk"
	<alvaro.soliverez@collabora.co.uk>
Subject: Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library
Date: Tue, 23 Jan 2018 01:33:06 -0200	[thread overview]
Message-ID: <87bmhl8d1p.fsf@collabora.co.uk> (raw)
In-Reply-To: <DF4PR8401MB0812536DFB46D5F762D5D75585EA0@DF4PR8401MB0812.NAMPRD84.PROD.OUTLOOK.COM> (Olaf Weber's message of "Tue, 16 Jan 2018 22:19:24 +0000")

"Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@hpe.com>
writes:


>> One question that I have, on the other hand: Take the version you
>> shared, I want to avoid the -EINVAL for the case when strings s1
>> and s2 should match as equal, but strlen(s1) < strlen (s2).  In this
>> case:
>> 
>> strncmp (s1, s2, strlen (s2)) => Returns 0.  Matches Ok
>> strncmp (s1, s2, strlen (s1)) => Returns -EINVAL
>> 
>> I know -EINVAL doesn't mean they don't match, but this case seems too
>> error prone.
>
> If I understand your question correctly, the case of interest is
>
> 	strncmp(s1, s2, len), where len <= strlen(s1) and len <= strlen(s2)
>
> As far as I can tell the code I sketched above handles that case in the
> way you expect/want, when taking the complications introduced by
> Unicode into account. Using utf8ncursor() ensures we do get an -EINVAL
> if, and only if, we read beyond the end (len) of the source string as part of
> the normalization process. But if we are at an acceptable boundary in the
> source string when we see the end of the string, utf8byte() returns 0,
> indicating a normal/non-error end of the scan.

Hey Olaf,

Sorry for the delay.

It is not quite that scenario.  The version that requires only 1 length
fails when utf8ncursor receives a len that is smaller than one of the
strings, which is a common case when something decomposes to a larger
string:

Take this case, for instance:

s1 = {0xc2, 0xbc, 0x00},   /* 'VULGAR FRACTION ONE QUARTER' decomposes to */
s2 = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00},  /* 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */

If we do strncmp(s1, s2, strlen(s2)), it works fine.  But if we use
strlen(s1) on the third parameter, it fails.  As far as I understand,
the issue happens because utf8lookup will read to the maximum of len
characters, aborting the lookup in the middle of a sequence. Since we
don't hit a leaf for that code-point, it assumes an invalid sequence and
utf8byte aborts.

The easiest way to solve it is by receiving the two lens in strncmp.

> I think it may be worth to write some tests to (hopefully) confirm that
> the code really does what I intended it to do. The most likely case to
> fail would be where you hit the len-imposed end after a codepoint
> with CCC != 0.

The only test I had for this scenario happened to have strlen(s1) ==
strlen(s2).  I added the following, which I think catches this scenario:

/* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK'  decomposes to */
/* 'LETTER A' + 'COMBINING  OGONEK' + 'COMBINING DIAERESIS' */
  s1 = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
  s2 = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},

I tested your version, and it works correctly for this scenario too, as
long as we set the len parameter to use the largest string, s2, instead
of s1.

>> I suppose we just could:
>> 
>>  (1) let the caller deal with it, which is error prone.  Or,
>
> The caller does have to do something when it gets -EINVAL. You have to
> define the desired semantics of that case.
>
> In the original XFS-filesystem code my choice was to treat invalid UTF-8
> sequences as binary blobs for the sake of comparisons.
>
>>  (2) Require two lens on strncmp, one for each string, Or,
>
> As a general rule this is certainly correct: each string has its
> own associated maximum length within which it should have
> been null-terminated.  So whether you need one len per
> string depends on the sources of the strings. In the original
> XFS-based code there are some tricks related to this.

I've applied this solution and it is solving every test case correctly,
including those I mentioned above.  Since it looks like the best
approach, I applied the other things you commented, and modified the
comparison functions code to receive 2 lens.  I should submit a v2
shortly, once I'm done with dealing with some changes to the fs part.

Thanks!

-- 
Gabriel Krisman Bertazi

  reply	other threads:[~2018-01-23  3:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12  7:12 [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 01/13] charsets: Introduce middle-layer for character encoding Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 02/13] charsets: ascii: Wrap ascii functions to charsets library Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 03/13] charsets: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-01-12 16:59   ` Darrick J. Wong
2018-01-12 20:29     ` Weber, Olaf (HPC Data Management & Storage)
2018-01-13  0:24   ` Theodore Ts'o
2018-01-13  4:28     ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 04/13] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 05/13] charsets: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 06/13] charsets: utf8: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library Gabriel Krisman Bertazi
2018-01-12 10:38   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-16 16:50     ` Gabriel Krisman Bertazi
2018-01-16 22:19       ` Weber, Olaf (HPC Data Management & Storage)
2018-01-23  3:33         ` Gabriel Krisman Bertazi [this message]
2018-01-12  7:12 ` [PATCH RFC 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 09/13] ext4: Add ignorecase mount option Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 10/13] ext4: Include encoding information on the superblock Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 11/13] fscrypt: Introduce charset-based matching functions Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 12/13] ext4: Support charset name matching Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets Gabriel Krisman Bertazi
2018-01-12 10:52   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-12 16:56 ` [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Jeremy Allison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bmhl8d1p.fsf@collabora.co.uk \
    --to=krisman@collabora.co.uk \
    --cc=alvaro.soliverez@collabora.co.uk \
    --cc=david@fromorbit.com \
    --cc=kernel@lists.collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=olaf.weber@hpe.com \
    --cc=tytso@mit.edu \
    --subject='Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).