[emacs-bidi] Re: Arabic support

Discussion:

[emacs-bidi] Re: Arabic support

Eli Zaretskii

2010-08-27 09:56:47 UTC

> From: Kenichi Handa <***@m17n.org>
> Date: Thu, 26 Aug 2010 10:10:05 +0900
>
> I've just committed changes to trunk for Arabic shaping. If
> there're any Arabic users in this list, please check the
> displaying of Arabic text. On GNU/Linux system, you must
> compile Emacs with libotf and m17n-lib (configure script
> should detect them automatically).

Thanks. However, today's build behaves very strangely in a GUI
session on MS-Windows. For starters, cursor motion seems to jump
across many characters in the "Arabic" line of etc/HELLO. For
example, typing C-f in that line, I first move one character at a time
across "Arabic", as expected, then the cursor jumps to the right paren
of the leftmost parenthesized part, again as expected, and then I see
the following strange behavior:

. C-f moves one character to the left, to buffer position 758, as
expected.

. the next C-f jumps across many characters on the screen and lands
on position 764.

. another C-f jumps to what is reported as position 765, but on the
screen those are several characters, maybe 5 or 6.

. another C-f moves to the left paren at position 766, as expected.

. yet another C-f moves to position 767, but on the screen the
cursor jumps back into one of the characters it jumped across when
it landed on position 765 two C-f keypresses earlier.

. if I type C-b 4 times from this point, I enter a "trap", whereby
typing C-b jumps between two characters, whose buffer positions
are 764 and 765. The only way to get out of the trap is with C-a
or C-e or C-f.

I don't read Arabic, so I cannot really say whether any of this is
expected behavior. (The "trap" with C-b is certainly not the expected
behavior.) Do you see anything similar on X?

Eli Zaretskii

2010-08-28 07:27:32 UTC

> Date: Sat, 28 Aug 2010 03:35:33 +0200 (CEST)
> From: ***@free.fr
>
> Dear Eli and Kenichi,
> I would like to help but I have no real competence on lisp.
> I'm however familiar with emacs and arabic speaker.
> I tried to build emacs from trunk on MS-windows without succes.
>
> Is it possible to find binaries ?

Yes, the Windows binaries of the development code can be found here:

http://alpha.gnu.org/gnu/emacs/windows/

Note, however, that they are updated roughly once a week, and the last
snapshot was from August 16. That is before Handa-san installed his
latest changes, so please wait for the next snapshot, which I hope
will be there shortly.

Thanks.

Amit Aronovitch

2010-08-28 10:32:00 UTC

Technical question about rebuilding emacs:

The build process only warns about .elc files which are older than their
respective .el
Is there a way to tell it to recompile them?

The problem is that old .elc files (which, btw, are kept in the source tree
even though I configure a separate build directory) sometimes break the
build.
I currently solve the problem by using "bzr clean-tree --ignored". But this
deletes ALL elc files and the price is lengthy build time...

thanks,
AA

Eli Zaretskii

2010-08-28 11:26:30 UTC

> Date: Sat, 28 Aug 2010 13:32:00 +0300
> From: Amit Aronovitch <***@gmail.com>
> Cc: emacs-***@gnu.org
>
> Technical question about rebuilding emacs:
>
> The build process only warns about .elc files which are older than their
> respective .el
> Is there a way to tell it to recompile them?

This happens automatically for me, but I build in the source tree.

Anyway, this should do it if it doesn't automatically:

cd lisp && make

Amit Aronovitch

2010-08-28 10:15:44 UTC

On Fri, Aug 27, 2010 at 12:56 PM, Eli Zaretskii <***@gnu.org> wrote:

> > From: Kenichi Handa <***@m17n.org>
> > Date: Thu, 26 Aug 2010 10:10:05 +0900
> >
> > I've just committed changes to trunk for Arabic shaping. If
> > there're any Arabic users in this list, please check the
> > displaying of Arabic text. On GNU/Linux system, you must
> > compile Emacs with libotf and m17n-lib (configure script
> > should detect them automatically).
>
> Thanks. However, today's build behaves very strangely in a GUI
> session on MS-Windows. For starters, cursor motion seems to jump
> across many characters in the "Arabic" line of etc/HELLO. For
> example, typing C-f in that line, I first move one character at a time
> across "Arabic", as expected, then the cursor jumps to the right paren
> of the leftmost parenthesized part, again as expected, and then I see
> the following strange behavior:
>
> . C-f moves one character to the left, to buffer position 758, as
> expected.
>
> . the next C-f jumps across many characters on the screen and lands
> on position 764.
>
> . another C-f jumps to what is reported as position 765, but on the
> screen those are several characters, maybe 5 or 6.
>
> . another C-f moves to the left paren at position 766, as expected.
>
> . yet another C-f moves to position 767, but on the screen the
> cursor jumps back into one of the characters it jumped across when
> it landed on position 765 two C-f keypresses earlier.
>
> . if I type C-b 4 times from this point, I enter a "trap", whereby
> typing C-b jumps between two characters, whose buffer positions
> are 764 and 765. The only way to get out of the trap is with C-a
> or C-e or C-f.
>
> I don't read Arabic, so I cannot really say whether any of this is
> expected behavior. (The "trap" with C-b is certainly not the expected
> behavior.) Do you see anything similar on X?
>
>
1) I confirm that Arabic shaping seems to work fine on my build (27/8/10
rev. 101200, on Linux+X (Debian unstable)).

2) Logical movement with C-f/C-b in the hello file seems fine (I do not see
the trap described above).

3) My Arabic is very basic, and I am not familiar with Arabic computing
(keyboards etc.) - I noticed the following points, but I am not sure what is
the expected behavior (I can only compare to other programs - gedit in this
case):

a) Column numbers (column-number-mode) behave strangely (I suspect that
m17n-lib's invisible markup consume column numbers). For example as you move
using C-f in the word "ÙØ°Ø§" column numbers go through "0,1,4,5" (i.e. the
second character takes up 3 columns). If I change that to "ØšÙØ°Ø§", the column
positions are "0,1,4,6,7" (the second and third chars take up 3 and 2
columns resp.?).
In gedit column positions are 1 character per column and do not depend on
the shaping.

b) Arabic keyboard has the ligature "Lam-Alef" (U+FEFB) on the key marked
"B" in qwerty keyboards. When I type this in emacs, I get Lam and Alef
(which are auto-shaped correctly as the proper ligature). C-d when cursor is
on the ligature erases the Alef and another C-d erases the Lam. This seems
like proper behavior to me. However, in gedit, the "B" key produces a
(U+FEFB) which is always displayed as a ligature, deleted in a single Del
press, and never connected to previous character. Cut and pasting this into
emacs, I get a similar behavior there.
The question is: do Arabic users expect to be able to produce this "stiff"
ligature? Is the behavior of gedit a bug? Should the emacs "Lam-Alef" key
behave as it does (i.e. produce two characters)?

thanks,
Amit Aronovitch

James Cloos

2010-08-29 05:13:07 UTC

>>>>> "AA" == Amit Aronovitch <***@gmail.com> writes:

AA> b) Arabic keyboard has the ligature "Lam-Alef" (U+FEFB) on the key marked
AA> "B" in qwerty keyboards. When I type this in emacs, I get Lam and Alef
AA> (which are auto-shaped correctly as the proper ligature).

Emacs' Arabic keyboard is based on a patch I proposed, which in turn is
based on typing the relevant keys with the X11 keyboard set to the
Arabic keyboard from xkeyboard-config. The mapping of U+FEFB, U+FEF9,
U+FEF7 and U+FEF5 to the strings "لا", "لإ", "لأ", "لآ" (I hope those
pasted correctly) come from a patch to libX11's utf-8 Compose file which
I pushed and which was submitted by Khaled Hosny, who has been quite
active in i18n circles.

My guess is that the behaviour you described, given why it occurs, is
the desired behaviour.

But that is only a guess. I don't read or speak any of the languages
which use the Arabic script, I'm just interested in i18n and pan-
language typography and font design.

-JimC
--
James Cloos <***@jhcloos.com> OpenPGP: 1024D/ED7DAEA6

James Cloos

2010-08-29 05:07:40 UTC

>>>>> "AA" == Amit Aronovitch <***@gmail.com> writes:

AA> b) Arabic keyboard has the ligature "Lam-Alef" (U+FEFB) on the key marked
AA> "B" in qwerty keyboards. When I type this in emacs, I get Lam and Alef
AA> (which are auto-shaped correctly as the proper ligature).

Emacs' Arabic keyboard is based on a patch I proposed, which in turn is
based on typing the relevant keys with the X11 keyboard set to the Arabic
keyboard from xkeyboard-config. The mapping of U+FEFB, U+FEF9, U+FEF7
and U+FEF5 to the strings "لا", "لإ", "لأ", "لآ" (I hope those pasted
correctly) come from a patch to libX11's utf-8 Compose file which I pushed
and which was written and submitted by Khaled Hosny, who has been quite
active

Kenichi Handa

2010-08-30 02:07:38 UTC

In article <AANLkTinFrEnuW=oPeBqg6=wYegbrR+***@mail.gmail.com>, Amit Aronovitch <***@gmail.com> writes:

> 1) I confirm that Arabic shaping seems to work fine on my build (27/8/10
> rev. 101200, on Linux+X (Debian unstable)).

> 2) Logical movement with C-f/C-b in the hello file seems fine (I do not see
> the trap described above).

Thank yor for testing them.

> 3) My Arabic is very basic, and I am not familiar with Arabic computing
> (keyboards etc.) - I noticed the following points, but I am not sure what i=
> s
> the expected behavior (I can only compare to other programs - gedit in this
> case):

> a) Column numbers (column-number-mode) behave strangely (I suspect that
> m17n-lib's invisible markup consume column numbers). For example as you mov=
> e
> using C-f in the word "=D9=87=D8=B0=D8=A7" column numbers go through "0,1,4=
> ,5" (i.e. the
> second character takes up 3 columns). If I change that to "=D8=A8=D9=87=D8=
> =B0=D8=A7", the column
> positions are "0,1,4,6,7" (the second and third chars take up 3 and 2
> columns resp.?).
> In gedit column positions are 1 character per column and do not depend on
> the shaping.

I've just committed a fix for this bug. It's not related to
m17n-lib.

---
Kenichi Handa
***@m17n.org

Amit Aronovitch

2010-08-30 13:42:38 UTC

On Mon, Aug 30, 2010 at 5:07 AM, Kenichi Handa <***@m17n.org> wrote:

> In article <AANLkTinFrEnuW=oPeBqg6=wYegbrR+***@mail.gmail.com<wYegbrR%***@mail.gmail.com>>,
> Amit Aronovitch <***@gmail.com> writes:
>
> > 1) I confirm that Arabic shaping seems to work fine on my build (27/8/10
> > rev. 101200, on Linux+X (Debian unstable)).
>
> > 2) Logical movement with C-f/C-b in the hello file seems fine (I do not
> see
> > the trap described above).
>
> Thank yor for testing them.
>
> > 3) My Arabic is very basic, and I am not familiar with Arabic computing
> > (keyboards etc.) - I noticed the following points, but I am not sure what
> i=
> > s
> > the expected behavior (I can only compare to other programs - gedit in
> this
> > case):
>
> > a) Column numbers (column-number-mode) behave strangely (I suspect that
> > m17n-lib's invisible markup consume column numbers). For example as you
> mov=
> > e
> > using C-f in the word "=D9=87=D8=B0=D8=A7" column numbers go through
> "0,1,4=
> > ,5" (i.e. the
> > second character takes up 3 columns). If I change that to
> "=D8=A8=D9=87=D8=
> > =B0=D8=A7", the column
> > positions are "0,1,4,6,7" (the second and third chars take up 3 and 2
> > columns resp.?).
> > In gedit column positions are 1 character per column and do not depend
> on
> > the shaping.
>
> I've just committed a fix for this bug. It's not related to
> m17n-lib.
>
>
Thanks. Much better now :-)

I also checked the diacritics (tashkil): It seems that they do not take up
column number in Emacs.

In gedit, cursor movement is similar, but the vowels there do take up column
number (as for cursor movement, as in emacs: forwards/backwards skips them,
while 'delete' handles them separately). I find this behavior more
consistent with the way both programs handle the lam-alef ligature (one
cursor-movement space, but two column numbers).
However, as I said, I do not know which behavior is the most natural for
Arabic users.

AA

Amit Aronovitch

2010-08-30 14:11:06 UTC

On Mon, Aug 30, 2010 at 4:42 PM, Amit Aronovitch <***@gmail.com>wrote:

>
> On Mon, Aug 30, 2010 at 5:07 AM, Kenichi Handa <***@m17n.org> wrote:
>
>> In article <AANLkTinFrEnuW=oPeBqg6=wYegbrR+***@mail.gmail.com<wYegbrR%***@mail.gmail.com>>,
>> Amit Aronovitch <***@gmail.com> writes:
>>
>> > 1) I confirm that Arabic shaping seems to work fine on my build (27/8/10
>> > rev. 101200, on Linux+X (Debian unstable)).
>>
>> > 2) Logical movement with C-f/C-b in the hello file seems fine (I do not
>> see
>> > the trap described above).
>>
>> Thank yor for testing them.
>>
>> > 3) My Arabic is very basic, and I am not familiar with Arabic computing
>> > (keyboards etc.) - I noticed the following points, but I am not sure
>> what i=
>> > s
>> > the expected behavior (I can only compare to other programs - gedit in
>> this
>> > case):
>>
>> > a) Column numbers (column-number-mode) behave strangely (I suspect
>> that
>> > m17n-lib's invisible markup consume column numbers). For example as you
>> mov=
>> > e
>> > using C-f in the word "=D9=87=D8=B0=D8=A7" column numbers go through
>> "0,1,4=
>> > ,5" (i.e. the
>> > second character takes up 3 columns). If I change that to
>> "=D8=A8=D9=87=D8=
>> > =B0=D8=A7", the column
>> > positions are "0,1,4,6,7" (the second and third chars take up 3 and 2
>> > columns resp.?).
>> > In gedit column positions are 1 character per column and do not depend
>> on
>> > the shaping.
>>
>> I've just committed a fix for this bug. It's not related to
>> m17n-lib.
>>
>>
> Thanks. Much better now :-)
>
> I also checked the diacritics (tashkil): It seems that they do not take up
> column number in Emacs.
>
> In gedit, cursor movement is similar, but the vowels there do take up
> column number (as for cursor movement, as in emacs: forwards/backwards skips
> them, while 'delete' handles them separately). I find this behavior more
> consistent with the way both programs handle the lam-alef ligature (one
> cursor-movement space, but two column numbers).
> However, as I said, I do not know which behavior is the most natural for
> Arabic users.
>
>
Checking the *Hebrew* diacritics (nikkud), I noticed a problem:
In some cases the diacritics are displayed in the wrong position (their
"real" cursor position is correct, which makes the UI *very* confusing).
e.g. if you type "â« ×¢Öž×Öµ×× ×ÖŒâ¬" , the Qamatz (first vowel) appears under the
space instead of under the Ain (first letter). If you remove the space, the
Qamatz does not appear at all. The Zeire (second vowel) appears under the
Ain (first vowel) instead of the Lamed (second letter). However, the Shuruk
sticks to the Vav (last letter) as it should (though the positioning is too
close and to high IMHO).
I do not know if this issue is specific to my build.
My complete config.log is available here:

http://dl.dropbox.com/u/6960989/dumps/config.log

AA

Eli Zaretskii

2010-08-30 18:50:34 UTC

The Windows binaries of today's development snapshot were just
uploaded to

http://alpha.gnu.org/gnu/emacs/windows/

TIA

Kenichi Handa

2010-09-03 07:35:47 UTC

In article <AANLkTinO+-***@mail.gmail.com>, Amit Aronovitch <***@gmail.com> writes:

> Checking the *Hebrew* diacritics (nikkud), I noticed a problem:
> In some cases the diacritics are displayed in the wrong position (their
> "real" cursor position is correct, which makes the UI *very* confusing).
> e.g. if you type "=E2=80=AB =D7=A2=D6=B8=D7=9C=D6=B5=D7=99=D7=A0=D7=95=D6=
> =BC=E2=80=AC" , the Qamatz (first vowel) appears under the
> space instead of under the Ain (first letter). If you remove the space, the
> Qamatz does not appear at all. The Zeire (second vowel) appears under the
> Ain (first vowel) instead of the Lamed (second letter). However, the Shuruk
> sticks to the Vav (last letter) as it should (though the positioning is too
> close and to high IMHO).
> I do not know if this issue is specific to my build.

I think it is specific to your Hebrew font. Please tell me
which font is selected by typing C-u C-x = on some Hebrew
character.

---
Kenichi Handa
***@m17n.org

Amit Aronovitch

2010-09-03 07:54:34 UTC

On Fri, Sep 3, 2010 at 10:35 AM, Kenichi Handa <***@m17n.org> wrote:

> In article <AANLkTinO+-***@mail.gmail.com<AANLkTinO%2B-***@mail.gmail.com>>,
> Amit Aronovitch <***@gmail.com> writes:
>
> > Checking the *Hebrew* diacritics (nikkud), I noticed a problem:
> > In some cases the diacritics are displayed in the wrong position (their
> > "real" cursor position is correct, which makes the UI *very* confusing).
> > e.g. if you type "=E2=80=AB
> =D7=A2=D6=B8=D7=9C=D6=B5=D7=99=D7=A0=D7=95=D6=
> > =BC=E2=80=AC" , the Qamatz (first vowel) appears under the
> > space instead of under the Ain (first letter). If you remove the space,
> the
> > Qamatz does not appear at all. The Zeire (second vowel) appears under the
> > Ain (first vowel) instead of the Lamed (second letter). However, the
> Shuruk
> > sticks to the Vav (last letter) as it should (though the positioning is
> too
> > close and to high IMHO).
> > I do not know if this issue is specific to my build.
>
> I think it is specific to your Hebrew font. Please tell me
> which font is selected by typing C-u C-x = on some Hebrew
> character.

Thanks for your reply.

Here is the output:

character: × (1500, #o2734, #x5dc)
preferred charset: iso-8859-8 (ISO/IEC 8859/8)
code point: 0xEC
syntax: w which means: word
category: .:Base
buffer code: #xD7 #x9C
file code: #xD7 #x9C (encoded by coding system utf-8-unix)
display: composed to form "×Öµ" (see below)

Composed with the following character(s) "Öµ" using this font:
xft:-unknown-DejaVu Sans-normal-normal-normal-*-13-*-*-*-*-0-iso10646-1
by these glyphs:
[0 1 1500 1328 8 1 7 11 0 nil]
[0 1 1461 1299 0 3 6 -1 2 nil]

Character code properties: customize what to show
name: HEBREW LETTER LAMED
general-category: Lo (Letter, Other)
----------------------------
I do not think I have any font-related customization in my .emacs (this is
the default choice on my system).

Any tips on which fonts I should use (and how to set them up)?

Amit

Kenichi Handa

2010-09-01 02:55:38 UTC

In article <***@mail.gmail.com>, Amit Aronovitch <***@gmail.com> writes:

> I also checked the diacritics (tashkil): It seems that they do not take up
> column number in Emacs.

> In gedit, cursor movement is similar, but the vowels there do take up column
> number (as for cursor movement, as in emacs: forwards/backwards skips them,
> while 'delete' handles them separately). I find this behavior more
> consistent with the way both programs handle the lam-alef ligature (one
> cursor-movement space, but two column numbers).
> However, as I said, I do not know which behavior is the most natural for
> Arabic users.

In Emacs, the column number affects when a user types C-n
(or C-p) to go to (roughly) the same x-position of the next
(or previous) line. So, the column number should reflects
the x-position, and for that, zero-width combining
characters should not be counted into the column number. Of
course, when a text is displayed by a variable-pitch font,
it is not good to use the column number for such a purpose,
but that is what Emacs has been done for long.

The lam-aref ligature case should be fixed so that the
ligature glyph is counted as one-column.

It seems that gedit uses the actual x-pixel-position to move
the cursor down. It will be better Emacs does the same in
the future.

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-01 04:58:59 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Wed, 01 Sep 2010 11:55:38 +0900
>
> In Emacs, the column number affects when a user types C-n
> (or C-p) to go to (roughly) the same x-position of the next
> (or previous) line. So, the column number should reflects
> the x-position, and for that, zero-width combining
> characters should not be counted into the column number.

But does the implementation of current-column, move-to-column and
friends support that? Perhaps I'm missing something, but my reading
of current_column_1 and its subroutines is that it only supports
display strings, composed characters, and display tables. Do
zero-width characters use any of these mechanisms?

Kenichi Handa

2010-09-01 05:06:55 UTC

In article <E1OqfPX-0008NB-***@fencepost.gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> > In Emacs, the column number affects when a user types C-n
> > (or C-p) to go to (roughly) the same x-position of the next
> > (or previous) line. So, the column number should reflects
> > the x-position, and for that, zero-width combining
> > characters should not be counted into the column number.

> But does the implementation of current-column, move-to-column and
> friends support that? Perhaps I'm missing something, but my reading
> of current_column_1 and its subroutines is that it only supports
> display strings, composed characters, and display tables. Do
> zero-width characters use any of these mechanisms?

Those functions basically uses char-width-table to get
width-in-column of each character, and zero-width combining
characters have 0 in that table.

---
Kenichi Handa
***@m17n.org

Kenichi Handa

2010-09-03 07:17:56 UTC

In article <jwvk4n6aqmk.fsf-monnier+***@gnu.org>, Stefan Monnier <***@iro.umontreal.ca> writes:

> > It seems that gedit uses the actual x-pixel-position to move the
> > cursor down. It will be better Emacs does the same in the future.

> The future has passed when Emacs-23 was released.

Wow, I didn't notice that.

---
Kenichi Handa
***@m17n.org

Kenichi Handa

2010-08-30 07:47:08 UTC

In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> Thanks. However, today's build behaves very strangely in
> a GUI session on MS-Windows. For starters, cursor motion
> seems to jump across many characters in the "Arabic" line
> of etc/HELLO. For example, typing C-f in that line, I
> first move one character at a time across "Arabic", as
> expected, then the cursor jumps to the right paren of the
> leftmost parenthesized part, again as expected, and then I
> see the following strange behavior:

I can't see that strange behaviour on GNU/Linux. Amit
Aronovitch <***@gmail.com> also reported that
rendering and cursor movement are ok on Debian. So, I
suspect that the problem is specific to Windows. In Emacs,
bidi reordering is done by Emacs itself, so the `shape'
method of font backend should not reorder glyphs. But,
perhaps Uniscribe backend reorders Arabic text, right?

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-08-30 14:06:51 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: emacs-***@gnu.org, emacs-***@gnu.org
> Date: Mon, 30 Aug 2010 16:47:08 +0900
>
> I can't see that strange behaviour on GNU/Linux. Amit
> Aronovitch <***@gmail.com> also reported that
> rendering and cursor movement are ok on Debian. So, I
> suspect that the problem is specific to Windows.

Looks like that, yes.

> In Emacs, bidi reordering is done by Emacs itself, so the `shape'
> method of font backend should not reorder glyphs. But, perhaps
> Uniscribe backend reorders Arabic text, right?

No, not AFAIK. We call the ScriptItemize API of Uniscribe with NULL
as the 4th and 5th arguments, which AFAIU should disable reordering.
Perhaps Jason could chime in and tell if I'm right here.

Btw, does the current code support Arabic ligatures and shaping on
GNU/Linux?

Kenichi Handa

2010-09-01 02:17:03 UTC

In article <E1Oq50d-0006YC-***@fencepost.gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> > In Emacs, bidi reordering is done by Emacs itself, so the `shape'
> > method of font backend should not reorder glyphs. But, perhaps
> > Uniscribe backend reorders Arabic text, right?

> No, not AFAIK. We call the ScriptItemize API of Uniscribe with NULL
> as the 4th and 5th arguments, which AFAIU should disable reordering.
> Perhaps Jason could chime in and tell if I'm right here.

I read the function uniscribe_shape roughly. It has this
code:

for (i = 0; i < nitems; i++)
{
int nglyphs, nchars_in_run, rtl = items[i].a.fRTL ? -1 : 1;
[...]
if (SUCCEEDED (result))
{
int j, nclusters, from, to;

from = rtl > 0 ? 0 : nchars_in_run - 1;

Doesn't it mean uniscribe_shape reorders glyphs?

> Btw, does the current code support Arabic ligatures and shaping on
> GNU/Linux?

I don't know about ligatures, but at least these should be
supported by libotf and m17n-lib with OpenType fonts.

o glyph substitution of consonants depending on where it is;
beginning, middle, or end of a word.
o glyph positioning of vowels

---
Kenichi Handa
***@m17n.org

Martin J. Dürst

2010-09-01 03:47:23 UTC

We have made similar observations with what might be double reordering
(or no reordering) on a Windows system. I expect we will report more
details tomorrow.

Regards, Martin.

On 2010/09/01 11:17, Kenichi Handa wrote:
> In article<E1Oq50d-0006YC-***@fencepost.gnu.org>, Eli Zaretskii<***@gnu.org> writes:
>
>>> In Emacs, bidi reordering is done by Emacs itself, so the `shape'
>>> method of font backend should not reorder glyphs. But, perhaps
>>> Uniscribe backend reorders Arabic text, right?
>
>> No, not AFAIK. We call the ScriptItemize API of Uniscribe with NULL
>> as the 4th and 5th arguments, which AFAIU should disable reordering.
>> Perhaps Jason could chime in and tell if I'm right here.
>
> I read the function uniscribe_shape roughly. It has this
> code:
>
> for (i = 0; i< nitems; i++)
> {
> int nglyphs, nchars_in_run, rtl = items[i].a.fRTL ? -1 : 1;
> [...]
> if (SUCCEEDED (result))
> {
> int j, nclusters, from, to;
>
> from = rtl> 0 ? 0 : nchars_in_run - 1;
>
> Doesn't it mean uniscribe_shape reorders glyphs?

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

大嶋俊祐

2010-09-02 07:45:02 UTC

Â Hello, everybody.

Â This appended picture shows a test of bidi-display-reordering. The upper case is the case that bidi-display-reordering is nil. The lower case is non-nil. The problem is that the characters are ordered left to right even they should been right to left when bidi-display-reordering is non-nil. This test is in Emacs Ver. 24.0.50.1 (i686-pc-cygwin) on Microsoft Windows XP Professional Version 2002 Service Pack 3.

Â Shunsuke OshimaÂ ***@hotmail.com

> Date: Wed, 1 Sep 2010 12:47:23 +0900
> From: ***@it.aoyama.ac.jp
> To: ***@m17n.org
> CC: ***@gnu.org; emacs-***@gnu.org; emacs-***@gnu.org; ***@gnu.org
> Subject: Re: [emacs-bidi] Re: Arabic support
>Â
> We have made similar observations with what might be double reorderingÂ
> (or no reordering) on a Windows system. I expect we will report moreÂ
> details tomorrow.
>Â
> Regards, Martin.
>Â
> On 2010/09/01 11:17, Kenichi Handa wrote:
>> In article<E1Oq50d-0006YC-***@fencepost.gnu.org>, Eli Zaretskii<***@gnu.org> writes:
>>
>>>> In Emacs, bidi reordering is done by Emacs itself, so the `shape'
>>>> method of font backend should not reorder glyphs. But, perhaps
>>>> Uniscribe backend reorders Arabic text, right?
>>
>>> No, not AFAIK. We call the ScriptItemize API of Uniscribe with NULL
>>> as the 4th and 5th arguments, which AFAIU should disable reordering.
>>> Perhaps Jason could chime in and tell if I'm right here.
>>
>> I read the function uniscribe_shape roughly. It has this
>> code:
>>
>> for (i = 0; i< nitems; i++)
>> {
>> int nglyphs, nchars_in_run, rtl = items[i].a.fRTL ? -1 : 1;
>> [...]
>> if (SUCCEEDED (result))
>> {
>> int j, nclusters, from, to;
>>
>> from = rtl> 0 ? 0 : nchars_in_run - 1;
>>
>> Doesn't it mean uniscribe_shape reorders glyphs?

Eli Zaretskii

2010-09-02 09:31:54 UTC

> From: 大嶋俊祐 <***@hotmail.com>
> Date: Thu, 2 Sep 2010 16:45:02 +0900
> Cc: emacs-***@gnu.org, emacs-***@gnu.org, ***@gnu.org
>
> This appended picture shows a test of bidi-display-reordering. The upper case is the case that bidi-display-reordering is nil. The lower case is non-nil. The problem is that the characters are ordered left to right even they should been right to left when bidi-display-reordering is non-nil. This test is in Emacs Ver. 24.0.50.1 (i686-pc-cygwin) on Microsoft Windows XP Professional Version 2002 Service Pack 3.

Thanks. This is a Cygwin build, which is supposed to use (a port of)
libotf. It isn't supposed to use Uniscribe directly.

So this particular problem could be the result of one or more of the
following factors:

. Some problem in the build process (I think Handa-san mentioned some
special actions for a proper build, like link against specific
libraries).

. Some bug in the ported libotf, which screws up bidirectional
display when the text is already reordered.

. Some bug in Emacs related to display bidirectional text.

The last one seems extremely unlikely, since AFAIU it works for
Handa-san on GNU/Linux and for me on MS-Windows, with Hebrew text.

To summarize, I don't think this problem is the same one as seen in
the native Windows build.

Martin J. Dürst

2010-09-02 12:58:48 UTC

Hello Eli,

Many thanks for your quick and detailled feedback. Any hints on how we
could go about to figure out which of the factors below is the culprit?
(e.g. check which libraries we linked against,...)

Regards, Martin.

On 2010/09/02 18:31, Eli Zaretskii wrote:
>> From: 大嶋俊祐<***@hotmail.com>
>> Date: Thu, 2 Sep 2010 16:45:02 +0900
>> Cc: emacs-***@gnu.org, emacs-***@gnu.org, ***@gnu.org
>>
>> This appended picture shows a test of bidi-display-reordering. The upper case is the case that bidi-display-reordering is nil. The lower case is non-nil. The problem is that the characters are ordered left to right even they should been right to left when bidi-display-reordering is non-nil. This test is in Emacs Ver. 24.0.50.1 (i686-pc-cygwin) on Microsoft Windows XP Professional Version 2002 Service Pack 3.
>
> Thanks. This is a Cygwin build, which is supposed to use (a port of)
> libotf. It isn't supposed to use Uniscribe directly.
>
> So this particular problem could be the result of one or more of the
> following factors:
>
> . Some problem in the build process (I think Handa-san mentioned some
> special actions for a proper build, like link against specific
> libraries).
>
> . Some bug in the ported libotf, which screws up bidirectional
> display when the text is already reordered.
>
> . Some bug in Emacs related to display bidirectional text.
>
> The last one seems extremely unlikely, since AFAIU it works for
> Handa-san on GNU/Linux and for me on MS-Windows, with Hebrew text.
>
> To summarize, I don't think this problem is the same one as seen in
> the native Windows build.
>
>

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-09-02 14:13:35 UTC

> Date: Thu, 02 Sep 2010 21:58:48 +0900
> From: "Martin J. Dürst" <***@it.aoyama.ac.jp>
> CC: 大嶋俊祐 <***@hotmail.com>, ***@m17n.org,
> emacs-***@gnu.org, emacs-***@gnu.org, ***@gnu.org
>
> Many thanks for your quick and detailled feedback. Any hints on how we
> could go about to figure out which of the factors below is the culprit?
> (e.g. check which libraries we linked against,...)

I would begin with posting here what the configure script displays at
the end of its run, where it tells which features will be available in
the build.

Handa-san said in a message about a week ago what libraries to link
against to get Arabic support, see:

http://lists.gnu.org/archive/html/emacs-devel/2010-08/msg01157.html

Make sure you do what he says there, and ask questions if you have
them. The versions of libotf you have might also help.

There are a couple of other users of Cygwin here, so perhaps asking
them to try to reproduce your problem would be a good idea. Asking on
the Cygwin mailing list regarding the degree of support for
bidirectional text in the ported libotf and its dependencies might
also help.

Hopefully, one or more of these will bring some ideas, and we could
take it from there.

Eli Zaretskii

2010-09-01 06:11:24 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Wed, 01 Sep 2010 11:17:03 +0900
>
> I read the function uniscribe_shape roughly. It has this
> code:
>
> for (i = 0; i < nitems; i++)
> {
> int nglyphs, nchars_in_run, rtl = items[i].a.fRTL ? -1 : 1;
> [...]
> if (SUCCEEDED (result))
> {
> int j, nclusters, from, to;
>
> from = rtl > 0 ? 0 : nchars_in_run - 1;
>
> Doesn't it mean uniscribe_shape reorders glyphs?

This reorders a single LGSTRING, according to my reading. Isn't an
LGSTRING a single grapheme cluster, rather than several distinct
characters?

Btw, where's the documentation of LGSTRING? The commentary to
uniscribe_shape says to look in font-make-gstring, but I cannot find
that, neither as function nor as variable. In general, everything
about compositions and lgstrings needs a lot more of documentation.

Kenichi Handa

2010-09-01 07:08:50 UTC

In article <E1OqgXc-0001rS-***@fencepost.gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> > Doesn't it mean uniscribe_shape reorders glyphs?

> This reorders a single LGSTRING, according to my reading. Isn't an
> LGSTRING a single grapheme cluster, rather than several distinct
> characters?

No, LGSTRING may contain multiple grapheme clusters. In the
case of arabic, we make LGSTRING for one Arabic word then
shape it (otherwise, the shaper can't know where in a word a
consonant appears). So, usually LGSTRING contains multiple
grapheme clusters for Arabic. Glyphs constituting a
grapheme cluster has the same value in LGLYPH_FROM (G) and
LGLYPH_TO (G) where G is a LGRYPH given by LGSTRING_GLYPH
(LGLYPH, IDX).

> Btw, where's the documentation of LGSTRING? The commentary to
> uniscribe_shape says to look in font-make-gstring, but I cannot find
> that, neither as function nor as variable. In general, everything
> about compositions and lgstrings needs a lot more of documentation.

I renamed font-make-gstring to composition-get-gstring and
moved the code to composite.c. The above macros for
accessing LGSTRING and LGLYPH are in composite.h.

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-02 11:53:15 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Wed, 01 Sep 2010 16:08:50 +0900
>
> LGSTRING may contain multiple grapheme clusters. In the
> case of arabic, we make LGSTRING for one Arabic word then
> shape it (otherwise, the shaper can't know where in a word a
> consonant appears).

Where can I find the code which decides how to break text into
LGSTRINGs? I'd like to see such code for both Arabic and Hebrew,
unless it's the same code.

For example, can characters like digits or other neutrals be included
in the same LGSTRING with Arabic and Hebrew? Or will an LGSTRING
always include characters from one script only?

I'm asking because it's possible that we will need to modify
w32uniscribe.c to reorder R2L characters before we pass them to the
Uniscribe ScriptShape API, to let it see the characters in the logical
order it expects them. That's if it turns out that Uniscribe cannot
otherwise shape them correctly.

TIA

Eli Zaretskii

2010-09-02 12:00:30 UTC

And another question: AFAIU, an LGSTRING specifies characters as
Unicode codepoints, while the Windows Uniscribe APIs expect wchar_t
wide characters, which on Windows means UTF-16. This means we should
encode the codepoints in LGSTRINGs to UTF-16 before passing them to
Uniscribe, rather than passing them unaltered, right? The current
code will break for characters whose Unicode codepoints are beyond the
BMP, right?

Eli Zaretskii

2010-09-02 14:29:36 UTC

> From: Jason Rumney <***@gnu.org>
> Cc: ***@m17n.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Thu, 02 Sep 2010 21:09:57 +0800
>
> In practice I don't think any of the scripts that we use the shaping
> engine for lie beyond the BMP.

By "scripts that we use" you mean scripts supported by Uniscribe?
scripts supported by Emacs? scripts available on most Windows systems?
something else?

I do see some useful characters beyond the BMP, e.g. the 1DXXX block
for mathematical symbols, 1F1XX for parenthesized and circled Latin
letters, CJK Ideographs extensions in the 20XXX and 2AXXX ranges, etc.

Thanks.

Kenichi Handa

2010-09-02 13:01:07 UTC

In article <E1Or8Lz-0004if-***@fencepost.gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> Where can I find the code which decides how to break text into
> LGSTRINGs? I'd like to see such code for both Arabic and Hebrew,
> unless it's the same code.

A not-yet-shaped LGSTRING is created by autocmp_chars
(composite.c) from a character sequence matching with a
regular expression PATTERN stored in a
composition-function-table. This pattern is
"[\u0600-\u06FF]+" for Arabic (lisp/language/misc-lang.el),
and a more complicated regex for Hebrew
(lisp/language/hebrew.el).

> For example, can characters like digits or other neutrals be included
> in the same LGSTRING with Arabic and Hebrew? Or will an LGSTRING
> always include characters from one script only?

LGSTRING always includes characters of the same font. So,
even if you wrote PATTERN to include the other neutrals, if
a user's font setting (or environment) decides to user a
different font for those neutrals, they are not included in
LGSTRING. By default, Emacs tries to use the same font for
characters in the same script.

In addition, even if you setup fonts to use the same font
for, for instance, Hebrew and those neutrals, "shape" method
of a font-backend may not support them. In that case, the
composition fails anyway.

> I'm asking because it's possible that we will need to modify
> w32uniscribe.c to reorder R2L characters before we pass them to the
> Uniscribe ScriptShape API, to let it see the characters in the logical
> order it expects them. That's if it turns out that Uniscribe cannot
> otherwise shape them correctly.

??? Currently characters and glyphs in LGSTRING are always
in logical order. A "shape" method should also shape that
LGSTRING in logical order.

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-02 14:04:45 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Thu, 02 Sep 2010 22:01:07 +0900
>
> A not-yet-shaped LGSTRING is created by autocmp_chars
> (composite.c) from a character sequence matching with a
> regular expression PATTERN stored in a
> composition-function-table. This pattern is
> "[\u0600-\u06FF]+" for Arabic (lisp/language/misc-lang.el),
> and a more complicated regex for Hebrew
> (lisp/language/hebrew.el).

Thanks. So character compositions are used not only to compose
several characters into one glyph, but also to break text into
individually shaped chunks, is that right?

If so, auto-composition-mode cannot be turned off for scripts that
need this kind of "grouped shaping" without degrading the presentation
of these scripts to the point of illegibility?

> > I'm asking because it's possible that we will need to modify
> > w32uniscribe.c to reorder R2L characters before we pass them to the
> > Uniscribe ScriptShape API, to let it see the characters in the logical
> > order it expects them. That's if it turns out that Uniscribe cannot
> > otherwise shape them correctly.
>
> ??? Currently characters and glyphs in LGSTRING are always
> in logical order.

See my mail from yesterday, where I describe that I see in GDB that
Arabic characters in LGSTRINGs arrive to uniscribe_shape in visual
order:

http://lists.gnu.org/archive/html/emacs-devel/2010-09/msg00029.html

That is why I asked the question in the first place. What am I
missing?

Kenichi Handa

2010-09-03 01:00:02 UTC

In article <E1OrAPF-0000Gn-***@fencepost.gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> > A not-yet-shaped LGSTRING is created by autocmp_chars
> > (composite.c) from a character sequence matching with a
> > regular expression PATTERN stored in a
> > composition-function-table. This pattern is
> > "[\u0600-\u06FF]+" for Arabic (lisp/language/misc-lang.el),
> > and a more complicated regex for Hebrew
> > (lisp/language/hebrew.el).

> Thanks. So character compositions are used not only to compose
> several characters into one glyph, but also to break text into
> individually shaped chunks, is that right?

Yes.

> If so, auto-composition-mode cannot be turned off for scripts that
> need this kind of "grouped shaping" without degrading the presentation
> of these scripts to the point of illegibility?

Yes. And auto-composition-mode cannot be turned off for any
scripts that it is not enough to display glyphs
corresponding to characters; they are all Indics, some East
Asians, Arabic, Hebrew, etc. In this respect, Ababic is not
special. Even for some Indics, LGSTRING may contain
multibyte grapheme clusters.

> > > I'm asking because it's possible that we will need to modify
> > > w32uniscribe.c to reorder R2L characters before we pass them to the
> > > Uniscribe ScriptShape API, to let it see the characters in the logical
> > > order it expects them. That's if it turns out that Uniscribe cannot
> > > otherwise shape them correctly.
> >
> > ??? Currently characters and glyphs in LGSTRING are always
> > in logical order.

> See my mail from yesterday, where I describe that I see in GDB that
> Arabic characters in LGSTRINGs arrive to uniscribe_shape in visual
> order:

> http://lists.gnu.org/archive/html/emacs-devel/2010-09/msg00029.html

In this mail, you wrote:

> Also, it looks like uniscribe_shape is repeatedly called from
> font-shape-gstring to shape the same text that is progressively
> shortened. For example, the first call will be with a 7-character
> string whose contents is

> {0x627, 0x644, 0x633, 0x651, 0x644, 0x627, 0x645}

and this character sequence is surely in logical order. So
I don't know why you think uniscribe_shape is given a
LGSTRING of visual order.

> The next call is with a 6-character string whose contents is

> {0x627, 0x644, 0x633, 0x651, 0x644, 0x627}

> then a 5-character string {0x627, 0x644, 0x633, 0x651, 0x644}, etc.

> Note that the first 7-character string is the first word of the Arabic
> greeting, properly bidi-reordered for display.

> Are these series of calls expected?

No. I don't know why that happens on Windows. On Ubuntu,
when I visit a file that contains only these lines:
------------------------------------------------------------
Arabic السّلام
;;; Local Variables:
;;; bidi-display-reordering: t
;;; End:
------------------------------------------------------------
font-shape-gstring is called just once.

As the lgstring is getting shorter each time, it seems that
composition fails each time.

autocmp_chars is mainly called from composition_reseat_it.
Could you please trace the code after the first call of
autocmp_chars, and find why Emacs descides that a
composition fails.

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-03 09:16:44 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Fri, 03 Sep 2010 10:00:02 +0900
>
> > If so, auto-composition-mode cannot be turned off for scripts that
> > need this kind of "grouped shaping" without degrading the presentation
> > of these scripts to the point of illegibility?
>
> Yes. And auto-composition-mode cannot be turned off for any
> scripts that it is not enough to display glyphs
> corresponding to characters; they are all Indics, some East
> Asians, Arabic, Hebrew, etc.

Are you sure Hebrew belongs to this list? What Hebrew characters need
to be shaped together, but still displayed as separate glyphs (as
opposed to the diacriticals which are composed into the same glyph
with the base character)?

> > The next call is with a 6-character string whose contents is
>
> > {0x627, 0x644, 0x633, 0x651, 0x644, 0x627}
>
> > then a 5-character string {0x627, 0x644, 0x633, 0x651, 0x644}, etc.
>
> As the lgstring is getting shorter each time, it seems that
> composition fails each time.
>
> autocmp_chars is mainly called from composition_reseat_it.
> Could you please trace the code after the first call of
> autocmp_chars, and find why Emacs descides that a
> composition fails.

Will do.

Thanks.

David Kastrup

2010-09-03 10:18:11 UTC

Eli Zaretskii <***@gnu.org> writes:

>> From: Kenichi Handa <***@m17n.org>
>> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
>> Date: Fri, 03 Sep 2010 10:00:02 +0900
>>
>> > If so, auto-composition-mode cannot be turned off for scripts that
>> > need this kind of "grouped shaping" without degrading the presentation
>> > of these scripts to the point of illegibility?
>>
>> Yes. And auto-composition-mode cannot be turned off for any
>> scripts that it is not enough to display glyphs
>> corresponding to characters; they are all Indics, some East
>> Asians, Arabic, Hebrew, etc.
>
> Are you sure Hebrew belongs to this list? What Hebrew characters need
> to be shaped together, but still displayed as separate glyphs (as
> opposed to the diacriticals which are composed into the same glyph
> with the base character)?

I'd think that the letter combinations tsvey vovn וו and tsvey yudn יי
(in Yiddish likely represented with their own characters װ and ײ, also
of interest ױ) might call for common shaping in more sophisticated
fonts.

But I have no actual clue.

--
David Kastrup

Kenichi Handa

2010-09-03 11:08:55 UTC

In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> > Yes. And auto-composition-mode cannot be turned off for any
> > scripts that it is not enough to display glyphs
> > corresponding to characters; they are all Indics, some East
> > Asians, Arabic, Hebrew, etc.

> Are you sure Hebrew belongs to this list? What Hebrew characters need
> to be shaped together, but still displayed as separate glyphs (as
> opposed to the diacriticals which are composed into the same glyph
> with the base character)?

??? I didn't write such a thing. What I listed are scripts
"that it is not enough to display glyphs corresponding to
characters". More precisely, "... that it is not enough to
display glyphs corresponding to characters at normal
positions suggested by each glyph metrics.".

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-03 14:54:21 UTC

> From: Kenichi Handa <***@m17n.org>
> Date: Fri, 03 Sep 2010 20:08:55 +0900
> Cc: emacs-***@gnu.org, emacs-***@gnu.org, ***@gnu.org
>
> In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:
>
> > > Yes. And auto-composition-mode cannot be turned off for any
> > > scripts that it is not enough to display glyphs
> > > corresponding to characters; they are all Indics, some East
> > > Asians, Arabic, Hebrew, etc.
>
> > Are you sure Hebrew belongs to this list? What Hebrew characters need
> > to be shaped together, but still displayed as separate glyphs (as
> > opposed to the diacriticals which are composed into the same glyph
> > with the base character)?
>
> ??? I didn't write such a thing. What I listed are scripts
> "that it is not enough to display glyphs corresponding to
> characters". More precisely, "... that it is not enough to
> display glyphs corresponding to characters at normal
> positions suggested by each glyph metrics.".

Sorry for my misunderstanding.

Eli Zaretskii

2010-09-03 13:25:49 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Fri, 03 Sep 2010 10:00:02 +0900
>
> > > > I'm asking because it's possible that we will need to modify
> > > > w32uniscribe.c to reorder R2L characters before we pass them to the
> > > > Uniscribe ScriptShape API, to let it see the characters in the logical
> > > > order it expects them. That's if it turns out that Uniscribe cannot
> > > > otherwise shape them correctly.
> > >
> > > ??? Currently characters and glyphs in LGSTRING are always
> > > in logical order.
>
> > See my mail from yesterday, where I describe that I see in GDB that
> > Arabic characters in LGSTRINGs arrive to uniscribe_shape in visual
> > order:
>
> > http://lists.gnu.org/archive/html/emacs-devel/2010-09/msg00029.html
>
> In this mail, you wrote:
>
> > Also, it looks like uniscribe_shape is repeatedly called from
> > font-shape-gstring to shape the same text that is progressively
> > shortened. For example, the first call will be with a 7-character
> > string whose contents is
>
> > {0x627, 0x644, 0x633, 0x651, 0x644, 0x627, 0x645}
>
> and this character sequence is surely in logical order. So
> I don't know why you think uniscribe_shape is given a
> LGSTRING of visual order.

Sorry, you are right. I got fooled by the fact that the end of the
string is almost a mirror image of its beginning.

There's something I'm missing in how character compositions and font
shaping work together with bidi reordering. I need to understand that
to figure out what, if anything, needs to be fixed in uniscribe_shape
to get it to work correctly.

So let me describe how the bidi reordering works and my understanding
of how it interacts with character compositions, and ask you to
correct any inaccuracies and fill in the blanks. Thanks in advance.

There are two use-cases that bidi reordering supports. The first one
is reordering in left-to-right paragraphs, containing mostly L2R text
with embedded R2L characters. I will call this "the L2R paragraph"
case.

The other use-case is reordering in right-to-left paragraphs, which
typically almost entirely consist of R2L characters with embedded L2R
letters, digits, and other characters that are displayed left to
right. I call this "the R2L paragraph" case.

For L2R paragraphs, runs of R2L characters are delivered in reverse
order (ignoring for the moment complications caused by directional
override control characters). When the bidi iterator bumps into an
R2L character, it scans forward until the end of the run, then begins
to go back delivering the characters, thus reversing them on display.
When the run of R2L characters is exhausted, the iterator jumps to the
end of the run and resumes its normal forward scan.

For R2L paragraphs, runs of R2L characters are delivered in their
buffer's logical order, without reversing them. L2R characters in
such paragraphs _are_ reversed, by the same process of scanning
forward past them, then delivering them back to front. This produces
a mirror image of the line as it should be displayed, wherein the
character to be displayed the rightmost is the first glyph we produce.
To mirror the line into its correct order, the PRODUCE_GLYPHS macro,
which calls the produce_glyphs method of the terminal-specific
redisplay interface, _prepends_ each new glyph to those already
produced for the glyph row, rather than appending them in the L2R
paragraph case. To illustrate, if we have a buffer with the following
contents (capital letters represent R2L characters):

ABCD foo

then the bidi iterator will produce the characters in this order:

ABCD oof

and then PRODUCE_GLYPHS will mirror them into

foo DCBA

which is the correct visual order.

Note that in both cases, the glyph row generated by the above
procedure is drawn from left to right by the terminal-specific method
that delivers glyphs to the glass. That method draws glyphs one by
one in the order they are stored in the glyph row. No reordering
happens on this level, and in fact this level is totally ignorant
about the text directionality.

Enter character compositions.

During the buffer scan that delivers characters to PRODUCE_GLYPHS, if
the next character to be delivered is a composed character, then
composition_reseat_it and next_element_from_composition are called.
If they succeed to compose the character with one or more following
characters, the whole sequence of characters that where composed is
recorded in the glyph row as a single element of type IT_COMPOSITION.
This single element is expanded into the actual font glyphs when the
glyph row is drawn by the terminal-specific draw_glyphs method. The
bidi reordering treats this single element as if it were a single
glyph, and thus does not reorder its font glyphs. So this single
element winds up in the glyph row in the position corresponding to the
first character of the composed sequence.

The question is: in what order should the font glyphs be held in the
LGSTRING returned by the font driver's `shape' method? Let's take an
example. Suppose we have a L2R paragraph in a buffer with this
contents:

foobar ABCDE

and suppose that "ABCDE" will be shaped by the font driver's `shape'
method into a logical-order sequence of glyphs "XYZ". Since this is a
L2R paragraph, and since no reordering will happen to "XYZ" when it is
delivered to the glass, it must be stored in the LGSTRING in the
visual order, i.e. "ZYX", with X being the first character to be read
and the rightmost to display, Y the second, etc.

Now suppose we have a R2L paragraph:

ABCDE foobar

The mirroring of the glyph row in PRODUCE_GLYPHS will now produce

foobar XYZ

because it treats "XYZ" as a single element. Again, no reordering
will happen to "XYZ" when it is drawn on the terminal. So again, we
need "XYZ" to be stored in visual order, i.e. "ZYX".

You say that the contents of LGSTRING passed to the `shape' method are
in logical order. The conclusion from the above seems to be that we
need to have the `shape' method reorder the shaped glyphs into visual
order. Is that what happens with the libotf driver? does it indeed
reorder R2L glyphs it returns after reshaping? If not, how does a
reshaped sequence of glyphs winds up correctly on display?

Even if everything I said above is correct, there are complications.
ABCDE could be inside an embedding with left to right override, like
this:

foobar RLO ABCDE PDF

This should be displayed as

foobar ABCDE

i.e., "ABCDE" is not reordered, but displayed in the logical order, as
forced by RLO. Therefore, the reshaped "XYZ" should also be displayed
left to right:

foobar XYZ

But, if I understand correctly how composition works, the
auto-composed sequence in this case will still be just "XYZ", without
the RLO and PDF control characters. So the `shape' method of the font
driver will still see just "XYZ" in the LGSTRING, without the control
characters, and will reorder "XYZ", which is incorrect.

If we need the `shape' method to reorder glyphs, then in order for it
do its job correctly, we need to give it the entire bidi context of
the string we are asking it to reshape. In the above example, we need
to tell it about the override directive, i.e. pass it "ABCDE" with
surrounding RLO and PDF controls. This flies in the face of the
current design, which separates reordering from glyph shaping.

So the conclusion is that we need the `shape' method to return the
reshaped glyphs in the logical order, and then reorder them
afterwards. If this is correct, we need to make 2 changes:

. change the interface to the `shape' method, so that the reshaped
LGSTRING holds glyphs in the logical order

. modify fill_gstring_glyph_string to reorder glyphs when it puts
them into a glyph_string structure

Am I missing something?

Amit Aronovitch

2010-09-03 14:32:33 UTC

On Fri, Sep 3, 2010 at 4:25 PM, Eli Zaretskii <***@gnu.org> wrote:

<- Snipped text ->

> Even if everything I said above is correct, there are complications.
> ABCDE could be inside an embedding with left to right override, like
> this:
>
> foobar RLO ABCDE PDF
>
> This should be displayed as
>
> foobar ABCDE
>
> i.e., "ABCDE" is not reordered, but displayed in the logical order, as
> forced by RLO. Therefore, the reshaped "XYZ" should also be displayed
> left to right:
>
> foobar XYZ
>
> But, if I understand correctly how composition works, the
> auto-composed sequence in this case will still be just "XYZ", without
> the RLO and PDF control characters. So the `shape' method of the font
> driver will still see just "XYZ" in the LGSTRING, without the control
> characters, and will reorder "XYZ", which is incorrect.
>
> If we need the `shape' method to reorder glyphs, then in order for it
> do its job correctly, we need to give it the entire bidi context of
> the string we are asking it to reshape. In the above example, we need
> to tell it about the override directive, i.e. pass it "ABCDE" with
> surrounding RLO and PDF controls. This flies in the face of the
> current design, which separates reordering from glyph shaping.
>
>
Is that a typo? I think you mean LRO (U+202D) rather than RLO in all cases
above.
(Just commenting, to avoid further misunderstandings).

Amit

Eli Zaretskii

2010-09-03 14:43:37 UTC

> Date: Fri, 3 Sep 2010 17:32:33 +0300
> From: Amit Aronovitch <***@gmail.com>
> Cc: Kenichi Handa <***@m17n.org>, emacs-***@gnu.org, emacs-***@gnu.org, ***@gnu.org
>
> Is that a typo? I think you mean LRO (U+202D) rather than RLO in all cases
> above.

Yes, sorry. I meant LRO.

Eli Zaretskii

2010-09-04 07:13:47 UTC

> Date: Fri, 03 Sep 2010 16:25:49 +0300
> From: Eli Zaretskii <***@gnu.org>
> Cc: emacs-***@gnu.org, emacs-***@gnu.org, ***@gnu.org
>
> Am I missing something?

I think I found what I was missing. This part:

> During the buffer scan that delivers characters to PRODUCE_GLYPHS, if
> the next character to be delivered is a composed character, then
> composition_reseat_it and next_element_from_composition are called.
> If they succeed to compose the character with one or more following
> characters, the whole sequence of characters that where composed is
> recorded in the glyph row as a single element of type IT_COMPOSITION.
> This single element is expanded into the actual font glyphs when the
> glyph row is drawn by the terminal-specific draw_glyphs method. The
> bidi reordering treats this single element as if it were a single
> glyph, and thus does not reorder its font glyphs. So this single
> element winds up in the glyph row in the position corresponding to the
> first character of the composed sequence.

is inaccurate, and therefore leads to incorrect conclusions. A
(hopefully) more correct description is this:

During the buffer scan that delivers characters to PRODUCE_GLYPHS,
if the next character to be delivered is a composed character, then
composition_reseat_it and next_element_from_composition are called.
If they succeed to compose the character with one or more following
characters, the whole sequence of characters that where composed is
recorded in the `struct composition_it' object that is part of the
buffer iterator. The composed sequence could produce one or more
font glyphs (called "grapheme clusters") on the screen. Each of
these grapheme clusters is then delivered to PRODUCE_GLYPHS in the
direction corresponding to the current bidi scan direction. In
particular, if the bidi iterator currently scans the buffer
backwards, the grapheme clusters are delivered back to front. This
reorders the grapheme clusters as appropriate for the current bidi
context.

If this is correct, then the conclusion is that the font driver's
`shape' method should return the grapheme clusters in LGSTRING in
logical order; they will be reordered correctly by
next_element_from_composition, composition_reseat_it, and
set_iterator_to_next, as described above.

Did I get it right this time?

Kenichi Handa

2010-09-06 06:04:46 UTC

In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> I think I found what I was missing. This part:
[...]
> is inaccurate, and therefore leads to incorrect conclusions. A
> (hopefully) more correct description is this:

> During the buffer scan that delivers characters to PRODUCE_GLYPHS,
> if the next character to be delivered is a composed character, then
> composition_reseat_it and next_element_from_composition are called.
> If they succeed to compose the character with one or more following
> characters, the whole sequence of characters that where composed is
> recorded in the `struct composition_it' object that is part of the
> buffer iterator. The composed sequence could produce one or more
> font glyphs (called "grapheme clusters") on the screen. Each of
> these grapheme clusters is then delivered to PRODUCE_GLYPHS in the
> direction corresponding to the current bidi scan direction. In
> particular, if the bidi iterator currently scans the buffer
> backwards, the grapheme clusters are delivered back to front. This
> reorders the grapheme clusters as appropriate for the current bidi
> context.

> If this is correct,

Yes.

> then the conclusion is that the font driver's
> `shape' method should return the grapheme clusters in LGSTRING in
> logical order; they will be reordered correctly by
> next_element_from_composition, composition_reseat_it, and
> set_iterator_to_next, as described above.

> Did I get it right this time?

Yes.

One additional comment. A grapheme cluster in LGSTRING may
contain multiple glyphs, and the order of those glyphs
depends on a font backend, or even on a font, and are given
to `draw' method of a font backend without reordering. It's
the responsibility of 'shape' method to produce those glyphs
in the order that 'draw' method expects.

For instance, if LGSTRING has these LGLYPHS in this order:

G0: (glyph for a char at position N)
G1: (first glyph for chars at position N+1 to N+2)
G2: (second glyph for chars at position N+1 to N+2)
G3: (first glyph for chars at position N+3 to N+4)
G4: (second glyph for chars at position N+3 to N+4)
G5: (third glyph for chars at position N+3 to N+4)
G6: (glyph for a char at position N+5)

and we are producing glyphs backward, 'draw' method is given
glyphs in this order:

glyphs from G6 to G6 (both inclusive)
glyphs from G3 to G5 (both inclusive)
glyphs from G1 to G2 (both inclusive)
glyphs from G0 to G0 (both inclusive)

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-04 15:29:13 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Fri, 03 Sep 2010 10:00:02 +0900
>
> As the lgstring is getting shorter each time, it seems that
> composition fails each time.

No, the composition was succeeding. The problem was that
uniscribe_shape tried to reorder the grapheme clusters returned by
ScriptShape and ScriptPlace, and the result was that the FROM and TO
members of the LGSTRING object were not as set_iterator_to_next
expected. This caused the iterator to fail skipping the characters
that were already composed, it would instead move only one character
ahead.

Thanks to all the hints and useful information in this thread, I think
I succeeded to fix the code in uniscribe_shape, so now the display of
Arabic looks okay to me. Arabic input also seems to work; at least
Emacs no longer crashes. People who actually speak Arabic please
check the latest development code to see that it indeed works
correctly.

Eli Zaretskii

2010-09-02 14:49:43 UTC

> From: Jason Rumney <***@gnu.org>
> Cc: Kenichi Handa <***@m17n.org>, emacs-***@gnu.org, emacs-***@gnu.org
> Date: Thu, 02 Sep 2010 21:48:29 +0800
>
> Eli Zaretskii <***@gnu.org> writes:
>
> > No, not AFAIK. We call the ScriptItemize API of Uniscribe with NULL
> > as the 4th and 5th arguments, which AFAIU should disable reordering.
> > Perhaps Jason could chime in and tell if I'm right here.
>
> The documentation seems to imply that, but it looks like items[i].a.fRTL
> is being set anyway according to how uniscribe thinks the direction
> should be.

My interpretation of this is that the fRTL flag is set according to
the explicit directionality of the character deduced solely from its
codepoint, e.g. it is TRUE for Hebrew and Arabic letters and FALSE for
the rest. By contrast, a "full Unicode bidirectional analysis" that
ScriptItemize is advertised to perform when these arguments are
non-NULL includes the full implementation of UAX#9, under which
embeddings and implicit levels can affect the fRTL flag for characters
whose inherent attributes would say otherwise.

But that's a guess; the MS documentation is not very explicit on this,
to say the least.

> As well as removing the code that takes notice of the rtl flag and tries
> to reverse the output, you will probably have to set
> items[i].a.fLogicalOrder to 1 before calling ScriptShape to ensure
> logical order output from ScriptShape.

Right, thanks for the hint. However, given what Handa-san wrote, I'm
now utterly confused regarding the issue of ordering between Emacs and
Uniscribe.

m***@free.fr

2010-08-31 01:41:56 UTC

Dear all,
I have tried the MS-Windows (XP) version of 30/08.
The hello message is not correclty displayed, it is written from left to right,
moreover emacs crashes when trying to write in arabic.
I'm very sorry I couldn't do more tests..

----- Mail Original -----
De: "Amit Aronovitch" <***@gmail.com>
À: "Kenichi Handa" <***@m17n.org>
Cc: emacs-***@gnu.org, emacs-***@gnu.org
Envoyé: Lundi 30 Août 2010 16h11:06 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [emacs-bidi] Re: Arabic support

On Mon, Aug 30, 2010 at 4:42 PM, Amit Aronovitch < ***@gmail.com > wrote:

On Mon, Aug 30, 2010 at 5:07 AM, Kenichi Handa < ***@m17n.org > wrote:

In article <AANLkTinFrEnuW=oPeBqg6= wYegbrR+***@mail.gmail.com >, Amit Aronovitch < ***@gmail.com > writes:

> 1) I confirm that Arabic shaping seems to work fine on my build (27/8/10
> rev. 101200, on Linux+X (Debian unstable)).

> 2) Logical movement with C-f/C-b in the hello file seems fine (I do not see
> the trap described above).

Thank yor for testing them.

> 3) My Arabic is very basic, and I am not familiar with Arabic computing
> (keyboards etc.) - I noticed the following points, but I am not sure what i=
> s

> the expected behavior (I can only compare to other programs - gedit in this
> case):

> a) Column numbers (column-number-mode) behave strangely (I suspect that
> m17n-lib's invisible markup consume column numbers). For example as you mov=
> e
> using C-f in the word "=D9=87=D8=B0=D8=A7" column numbers go through "0,1,4=
> ,5" (i.e. the
> second character takes up 3 columns). If I change that to "=D8=A8=D9=87=D8=
> =B0=D8=A7", the column

> positions are "0,1,4,6,7" (the second and third chars take up 3 and 2
> columns resp.?).
> In gedit column positions are 1 character per column and do not depend on
> the shaping.

I've just committed a fix for this bug. It's not related to
m17n-lib.

Thanks. Much better now :-)

I also checked the diacritics (tashkil): It seems that they do not take up column number in Emacs.

In gedit, cursor movement is similar, but the vowels there do take up column number (as for cursor movement, as in emacs: forwards/backwards skips them, while 'delete' handles them separately). I find this behavior more consistent with the way both programs handle the lam-alef ligature (one cursor-movement space, but two column numbers).
However, as I said, I do not know which behavior is the most natural for Arabic users.

Checking the *Hebrew* diacritics (nikkud), I noticed a problem:
In some cases the diacritics are displayed in the wrong position (their "real" cursor position is correct, which makes the UI *very* confusing). e.g. if you type "‫ עָלֵינוּ‬" , the Qamatz (first vowel) appears under the space instead of under the Ain (first letter). If you remove the space, the Qamatz does not appear at all. The Zeire (second vowel) appears under the Ain (first vowel) instead of the Lamed (second letter). However, the Shuruk sticks to the Vav (last letter) as it should (though the positioning is too close and to high IMHO).
I do not know if this issue is specific to my build.
My complete config.log is available here:

http://dl.dropbox.com/u/6960989/dumps/config.log

AA

Eli Zaretskii

2010-08-31 16:51:27 UTC

> Date: Tue, 31 Aug 2010 03:41:56 +0200 (CEST)
> From: ***@free.fr
> Cc: emacs-***@gnu.org, Kenichi Handa <***@m17n.org>, emacs-***@gnu.org
>
> I have tried the MS-Windows (XP) version of 30/08.
> The hello message is not correclty displayed, it is written from left to right,

Thanks for testing. Could you please send an image of how the current
greeting should be displayed in correct visual order on the screen?
It's possible that the order of the characters in the HELLO file is
incorrect, and that's one reason why it is displayed incorrectly.

Thanks.

Thamer Mahmoud

2010-09-06 13:45:00 UTC

>> From: Kenichi Handa <***@m17n.org>
>> Date: Thu, 26 Aug 2010 10:10:05 +0900
>>
>> I've just committed changes to trunk for Arabic shaping. If
>> there're any Arabic users in this list, please check the
>> displaying of Arabic text. On GNU/Linux system, you must
>> compile Emacs with libotf and m17n-lib (configure script
>> should detect them automatically).
>

Thanks for working on this. Here is my take:

* Attached are two screenshots showing the Arabic line from the HELLO
file rendered by gedit and Emacs using the same font (Nazli-20 from
ttf-farsiweb). Notice that in Emacs not all fonts have their LAM and
ALIF properly replaced by the LAM-ALIF ligature. Also the diacritics
(SHADDA) appears lower and less legible for the same font.

* The third attachment shows that when highlighting a region of an
Arabic word, the cursor at the edges of the visible selection "breaks"
the shaping and reshapes the characters around it into their isolated
form. This creates a wave-effect of moving characters with some
visible artifacts and bad indention issues.

* While the cursor is at a composed character (e.g., SEEN+SHADDA),
pressing C-p moves point unexpectedly to the beginning of the current
line.

* I do at least see one "trap" with C-p, although it is hard to
reproduce. You can try moving 4 or 5 lines below the Arabic line in
the HELLO file, then move upward using 4-5 C-p and get the cursor at
the SEEN+SHADDA. After which any further C-p jumps between SEEN and
LAM-ALIF, never going to the previous line.

* For those using Debian (Squeeze), I had to install not just the
libm17n and libm17n-dev packages, but also m17n-db. It seems that the
configure script doesn't detect or know about the status of (the
Debian-specific) m17n-db.

Thanks again,
Thamer

TAKAHASHI Naoto

2010-09-07 04:22:50 UTC

Thamer Mahmoud writes:

> * Attached are two screenshots showing the Arabic line from the HELLO
> file rendered by gedit and Emacs using the same font (Nazli-20 from
> ttf-farsiweb). Notice that in Emacs not all fonts have their LAM and
> ALIF properly replaced by the LAM-ALIF ligature. Also the diacritics
> (SHADDA) appears lower and less legible for the same font.

This problem is caused by the combination of nazli.ttf, which has
rather unusual OTF tables, and a bug in m17n-db. I will try to find a
workaround.

--
TAKAHASHI Naoto
***@m17n.org

m***@free.fr

2010-09-07 00:59:48 UTC

I have tested the binaries on XP windows and here is some points:

- The hello message is Ok.
- I tried some simple arabic text and it seems working as I expected at least for me (I'm not using tashkeel).
- The second remark of Mahmoud is not an issue for me see attached picture (coupure).
- I have problems with copy/paste arabic text but I guess this may be a coding issue.

Best regards,
Mohamed

----- Mail Original -----
De: "Thamer Mahmoud" <***@gmail.com>
Ã: emacs-***@gnu.org
Cc: emacs-***@gnu.org
EnvoyÃ©: Lundi 6 Septembre 2010 15h45:00 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: [emacs-bidi] Re: Arabic support

>> From: Kenichi Handa <***@m17n.org>
>> Date: Thu, 26 Aug 2010 10:10:05 +0900
>>
>> I've just committed changes to trunk for Arabic shaping. If
>> there're any Arabic users in this list, please check the
>> displaying of Arabic text. On GNU/Linux system, you must
>> compile Emacs with libotf and m17n-lib (configure script
>> should detect them automatically).
>

Thanks for working on this. Here is my take:

* Attached are two screenshots showing the Arabic line from the HELLO
file rendered by gedit and Emacs using the same font (Nazli-20 from
ttf-farsiweb). Notice that in Emacs not all fonts have their LAM and
ALIF properly replaced by the LAM-ALIF ligature. Also the diacritics
(SHADDA) appears lower and less legible for the same font.

* The third attachment shows that when highlighting a region of an
Arabic word, the cursor at the edges of the visible selection "breaks"
the shaping and reshapes the characters around it into their isolated
form. This creates a wave-effect of moving characters with some
visible artifacts and bad indention issues.

* While the cursor is at a composed character (e.g., SEEN+SHADDA),
pressing C-p moves point unexpectedly to the beginning of the current
line.

* I do at least see one "trap" with C-p, although it is hard to
reproduce. You can try moving 4 or 5 lines below the Arabic line in
the HELLO file, then move upward using 4-5 C-p and get the cursor at
the SEEN+SHADDA. After which any further C-p jumps between SEEN and
LAM-ALIF, never going to the previous line.

* For those using Debian (Squeeze), I had to install not just the
libm17n and libm17n-dev packages, but also m17n-db. It seems that the
configure script doesn't detect or know about the status of (the
Debian-specific) m17n-db.

Thanks again,
Thamer

Eli Zaretskii

2010-09-07 03:03:07 UTC

> Date: Tue, 7 Sep 2010 02:59:48 +0200 (CEST)
> From: ***@free.fr
> Cc: emacs-***@gnu.org, emacs-***@gnu.org
>
> - I have problems with copy/paste arabic text but I guess this may be a coding issue.

Thanks for testing.

Would you please describe the problems you have with copy/paste?
There should be no problems with encoding on Windows, as Emacs uses
UTF-16 automatically to copy/paste text on Windows.

m***@free.fr

2010-09-07 03:34:04 UTC

Thanks Eli for all your efforts !

It is simple. When copying a text in arabic (see pictures) and pasting in the emacs buffer,
the result is a series of "?" characters (see resu.jpg).
I tried decode-coding-region (with utf-8, utf-16, and latin-1) without succes.

----- Mail Original -----
De: "Eli Zaretskii" <***@gnu.org>
Ã: ***@free.fr
Cc: emacs-***@gnu.org, emacs-***@gnu.org
EnvoyÃ©: Mardi 7 Septembre 2010 05h03:07 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [emacs-bidi] Re: Arabic support

> Date: Tue, 7 Sep 2010 02:59:48 +0200 (CEST)
> From: ***@free.fr
> Cc: emacs-***@gnu.org, emacs-***@gnu.org
>
> - I have problems with copy/paste arabic text but I guess this may be a coding issue.

Thanks for testing.

Would you please describe the problems you have with copy/paste?
There should be no problems with encoding on Windows, as Emacs uses
UTF-16 automatically to copy/paste text on Windows.

Eli Zaretskii

2010-09-07 04:39:53 UTC

> Date: Tue, 7 Sep 2010 05:34:04 +0200 (CEST)
> From: ***@free.fr
> Cc: emacs-***@gnu.org, emacs-***@gnu.org
>
> It is simple. When copying a text in arabic (see pictures) and
> pasting in the emacs buffer, the result is a series of "?"
> characters (see resu.jpg).

m***@free.fr

2010-09-07 15:08:01 UTC

You are right !

After invoking emacs -q I could do the following :

- copy from emacs to another application (it worked).
- copy arabic text from two different applications to emacs
it works correctly exepted that tashkeel seems lost when the source include it.
But after verification if I try to mark the region in question the tashkeel appears :)

In my dot emacs i found what may be the cause of my problem.

'(selection-coding-system (quote utf-8-dos))
'(unify-8859-on-decoding-mode t)
'(unify-8859-on-encoding-mode t)

Thanks

----- Mail Original -----
De: "Eli Zaretskii" <***@gnu.org>
À: ***@free.fr
Cc: emacs-***@gnu.org, emacs-***@gnu.org
Envoyé: Mardi 7 Septembre 2010 06h39:53 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [emacs-bidi] Re: Arabic support

> Date: Tue, 7 Sep 2010 05:34:04 +0200 (CEST)
> From: ***@free.fr
> Cc: emacs-***@gnu.org, emacs-***@gnu.org
>
> It is simple. When copying a text in arabic (see pictures) and
> pasting in the emacs buffer, the result is a series of "?"
> characters (see resu.jpg).

Eli Zaretskii

2010-09-13 06:40:14 UTC

> Date: Tue, 7 Sep 2010 17:08:01 +0200 (CEST)
> From: ***@free.fr
> Cc: emacs-***@gnu.org, emacs-***@gnu.org
>
> - copy arabic text from two different applications to emacs
> it works correctly exepted that tashkeel seems lost when the source include it.
> But after verification if I try to mark the region in question the tashkeel appears :)

It's probably some bad interaction between compositions and the
handling of faces in the bidirectional display. Perhaps Handa-san
could take a look at xdisp.c:handle_stop_backwards and how it is
called inside next_element_from_buffer -- there might be some bugs
there whereby only part of the composed sequence is redrawn when the
region is extended or contracted.

> In my dot emacs i found what may be the cause of my problem.
>
>
> '(selection-coding-system (quote utf-8-dos))

This one is your problem: you should never do that on MS-Windows.

> '(unify-8859-on-decoding-mode t)
> '(unify-8859-on-encoding-mode t)

These are obsolete, as everything is always unified in Emacs 24.

Kenichi Handa

2010-09-16 02:07:18 UTC

In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:

> It's probably some bad interaction between compositions and the
> handling of faces in the bidirectional display.

I found that the problem is that the current composition for
Arabic requires that a whole word must be composed. So, if
there's a face change within a word, Arabic composition
function is given just a partial word, and that results in
the incorrect Arabic shaping. This is a difficult problem,
and I need a time to find a solution.

---
Kenichi Handa
***@m17n.org

Kenichi Handa

2010-09-22 03:54:26 UTC

In article <***@m17n.org>, Kenichi Handa <***@m17n.org> writes:

> In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:
> > It's probably some bad interaction between compositions and the
> > handling of faces in the bidirectional display.

> I found that the problem is that the current composition for
> Arabic requires that a whole word must be composed. So, if
> there's a face change within a word, Arabic composition
> function is given just a partial word, and that results in
> the incorrect Arabic shaping. This is a difficult problem,
> and I need a time to find a solution.

I've just installed a fix to trunk.

---
Kenichi Handa
***@m17n.org

Eli Zaretskii

2010-09-22 07:33:09 UTC

> From: Kenichi Handa <***@m17n.org>
> Cc: ***@gnu.org, emacs-***@gnu.org, ***@free.fr, emacs-***@gnu.org
> Date: Wed, 22 Sep 2010 12:54:26 +0900
>
> In article <***@m17n.org>, Kenichi Handa <***@m17n.org> writes:
>
> > In article <***@gnu.org>, Eli Zaretskii <***@gnu.org> writes:
> > > It's probably some bad interaction between compositions and the
> > > handling of faces in the bidirectional display.
>
> > I found that the problem is that the current composition for
> > Arabic requires that a whole word must be composed. So, if
> > there's a face change within a word, Arabic composition
> > function is given just a partial word, and that results in
> > the incorrect Arabic shaping. This is a difficult problem,
> > and I need a time to find a solution.
>
> I've just installed a fix to trunk.

Thanks.

Thamer Mahmoud

2010-09-22 12:27:54 UTC

Kenichi Handa <***@m17n.org> writes:
>> I found that the problem is that the current composition for
>> Arabic requires that a whole word must be composed. So, if
>> there's a face change within a word, Arabic composition
>> function is given just a partial word, and that results in
>> the incorrect Arabic shaping. This is a difficult problem,
>> and I need a time to find a solution.
>
> I've just installed a fix to trunk.

I can confirm that the issue with unshaped glyphs while highlighting
words is now fixed. Thanks.

However, long Arabic strings still have unshaped middle parts and bad
margin. See the attached screenshot which is the output of M-30-<BAA>
in an empty buffer.

Also the following code produces duplicate strings, compared to when
auto-composition-mode is off.

(let ()
(setq bidi-display-reordering t)
(insert "\n\n")
(insert "ÙÙÙØª")
(insert "ØšØšØšØšØšØšØšØšØšØšØšØšØšØšØšØšØšØšØš"))

Kenichi Handa

2010-09-27 05:56:29 UTC

In article <***@zemblan.newkuwait.org>, Thamer Mahmoud <***@gmail.com> writes:

> However, long Arabic strings still have unshaped middle parts and bad
> margin. See the attached screenshot which is the output of M-30-<BAA>
> in an empty buffer.

Ah, I found what is wrong. In "struct glyph", we now have
only 4 bits to store indices into an LGSTRING.

struct
{
/* Flag to tell if the composition is automatic or not. */
unsigned automatic : 1;
/* ID of the composition. */
unsigned id : 23;
/* Start and end indices of glyphs of the composition. */
unsigned from : 4;
unsigned to : 4;
} cmp;

So, we could handle at most 16 glyphs in one composition.
I've just installed a fix to remove that restriction
(theoretically we still have a restriction of at most
0x7FFFFFFF glyphs in one composition).

---
Kenichi Handa
***@m17n.org

58 Replies
5 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Eli Zaretskii 2010-08-27 09:56:47 UTC

Eli Zaretskii 2010-08-28 07:27:32 UTC

Amit Aronovitch 2010-08-28 10:32:00 UTC

Eli Zaretskii 2010-08-28 11:26:30 UTC

Amit Aronovitch 2010-08-28 10:15:44 UTC

James Cloos 2010-08-29 05:13:07 UTC

James Cloos 2010-08-29 05:07:40 UTC

Kenichi Handa 2010-08-30 02:07:38 UTC

Amit Aronovitch 2010-08-30 13:42:38 UTC

Amit Aronovitch 2010-08-30 14:11:06 UTC

Eli Zaretskii 2010-08-30 18:50:34 UTC

Kenichi Handa 2010-09-03 07:35:47 UTC

Amit Aronovitch 2010-09-03 07:54:34 UTC

Kenichi Handa 2010-09-01 02:55:38 UTC

Eli Zaretskii 2010-09-01 04:58:59 UTC

Kenichi Handa 2010-09-01 05:06:55 UTC

Kenichi Handa 2010-09-03 07:17:56 UTC

Kenichi Handa 2010-08-30 07:47:08 UTC

Eli Zaretskii 2010-08-30 14:06:51 UTC

Kenichi Handa 2010-09-01 02:17:03 UTC

Martin J. Dürst 2010-09-01 03:47:23 UTC

大嶋俊祐 2010-09-02 07:45:02 UTC

Eli Zaretskii 2010-09-02 09:31:54 UTC

Martin J. Dürst 2010-09-02 12:58:48 UTC

Eli Zaretskii 2010-09-02 14:13:35 UTC

Eli Zaretskii 2010-09-01 06:11:24 UTC

Kenichi Handa 2010-09-01 07:08:50 UTC

Eli Zaretskii 2010-09-02 11:53:15 UTC

Eli Zaretskii 2010-09-02 12:00:30 UTC

Eli Zaretskii 2010-09-02 14:29:36 UTC

Kenichi Handa 2010-09-02 13:01:07 UTC

Eli Zaretskii 2010-09-02 14:04:45 UTC

Kenichi Handa 2010-09-03 01:00:02 UTC

Eli Zaretskii 2010-09-03 09:16:44 UTC

David Kastrup 2010-09-03 10:18:11 UTC

Kenichi Handa 2010-09-03 11:08:55 UTC

Eli Zaretskii 2010-09-03 14:54:21 UTC

Eli Zaretskii 2010-09-03 13:25:49 UTC

Amit Aronovitch 2010-09-03 14:32:33 UTC

Eli Zaretskii 2010-09-03 14:43:37 UTC

Eli Zaretskii 2010-09-04 07:13:47 UTC

Kenichi Handa 2010-09-06 06:04:46 UTC

Eli Zaretskii 2010-09-04 15:29:13 UTC

Eli Zaretskii 2010-09-02 14:49:43 UTC

m***@free.fr 2010-08-31 01:41:56 UTC

Eli Zaretskii 2010-08-31 16:51:27 UTC

Thamer Mahmoud 2010-09-06 13:45:00 UTC

TAKAHASHI Naoto 2010-09-07 04:22:50 UTC

m***@free.fr 2010-09-07 00:59:48 UTC

Eli Zaretskii 2010-09-07 03:03:07 UTC

m***@free.fr 2010-09-07 03:34:04 UTC

Eli Zaretskii 2010-09-07 04:39:53 UTC

m***@free.fr 2010-09-07 15:08:01 UTC

Eli Zaretskii 2010-09-13 06:40:14 UTC

Kenichi Handa 2010-09-16 02:07:18 UTC

Kenichi Handa 2010-09-22 03:54:26 UTC

Eli Zaretskii 2010-09-22 07:33:09 UTC

Thamer Mahmoud 2010-09-22 12:27:54 UTC

Kenichi Handa 2010-09-27 05:56:29 UTC

about - legalese

Loading...