[emacs-bidi] Suboptimal display-reordering in minibuffer

Date: Sat, 26 Jun 2010 10:38:54 -0400
Cc:=20
=20
Suppose I'm typing Hebrew and have bidi-display-reordering set glob=

ally

(i.e., by setq-default). I type a Hebrew control character, say
control-bet, which has no command binding. I quite properly get th=

message "control-bet is undefined" in the minibuffer. But because =

starts with a Hebrew character, it gets shoved against the right ma=

rgin,

which is wrong.

Why is it wrong?

And isn't the intent to have it on by default eventually?

Yes. In fact, Stefan (one of the two head maintainers) already asked
me to do that, but I'm stalling ;-)

Since system minibuffer are always in English, maybe the minibuffer
should never be in display reorder mode.

What do you mean by ``minibuffer are always in English''? The
language and the paragraph direction are not necessarily related.

=20
(defun forceLTR () (setq bidi-paragraph-direction 'left-to-rig=

ht))

(add-hook 'minibuffer-setup-hook 'forceLTR)
=20
but this doesn't work. Anybody know what I'm doing wrong?

I think this is because the minibuffer and the echo area are not the
same thing. They just use the same portion of the Emacs display. Th=
e
minibuffer is in use when Emacs prompts you for something, which is
not the case here.

Does it work to set bidi-display-reordering in two buffers named
" *Echo Area 0*" and " *Echo Area 1*"? (Note the blank at the
beginning of the names of these buffers: it is important.) You can
switch to these two buffers after entering Emacs, with the "C-x b"
command.

Another way to handle the case of an English paragraph that starts =

with

a Hebrew character is to insert an LRM. In this case that would ne=

ed to

be done by the code that finds character bindings. I think that co=

should indeed be sensitive to the fact that the unbound character i=

t's

about to echo might set display direction. Perhaps this is better =

than

setting the minibuffer's mode as a whole. Or maybe both are necess=

ary.

The main point here is deciding whether echo area messages should be
displayed with left-to-right paragraph direction forced on the displa=
y
engine. Are we sure this is the case? Cannot there be echo area
messages that we want to display with the right-to-left direction?

If/when the decision is made, it's very easy to make Emacs behave
either way.

Finally, an even more subtle (and unimportant) issue: The actual
message I see looks like this: "is undefined ^=D7=91=E2=80=9D. But=

I would have

expected "is undefined =D7=91^=E2=80=9D, no? Shouldn't control-bet=

be written with

the uparrow on the right when in RTL mode?

I don't know which one is the correct one. Do we have any "prior art=
"
in that some other applications display Ctrl-modified Hebrew
characters?

Again, once we decide what to do with that, it's a very simple matter
to make Emacs behave according to that decision.

Comments and opinions are welcome.

P.S. Thanks for starting these discussions. Sometimes I think that n=
o
one is using the bidi features, which makes me wonder why I worked on
them so hard. Your and Amit's messages mean a lot to me.

Eli Zaretskii

2010-06-26 16:18:01 UTC

Date: Sat, 26 Jun 2010 18:20:15 +0300
Does it work to set bidi-display-reordering in two buffers named
" *Echo Area 0*" and " *Echo Area 1*"?

Sorry, I meant bidi-paragraph-direction, of course.

Larry Denenberg

2010-06-26 19:17:01 UTC

Suppose . . . I type a Hebrew control character, say control-bet,
which has no command binding. I quite properly get the message
"control-bet is undefined" in the minibuffer. But because it starts
with a Hebrew character, it gets shoved against the right margin,
which is wrong.

Why is it wrong?

I suppose I should be hesitant since I've been quite properly rebuked
for insufficiently reflective use of the words "quite properly" in
another thread, but I'll take a shot anyway: It's wrong because this is
an LTR sentence that happens to start with an RTL character, so the bidi
code comes to the wrong conclusion about directionality. If the message
were instead "Key control-bet is undefined" we'd agree that it's LTR
with a single inserted RTL, right? Just because the RTL character is at
the beginning of the sentence doesn't make the sentence RTL.

Aren't problems like this the entire raison d'etre of the invisible RLM
and LRM characters?

Since system minibuffer are always in English, maybe the minibuffer
should never be in display reorder mode.

What do you mean by ``minibuffer are always in English''? The
language and the paragraph direction are not necessarily related.

More precisely, I meant "the messages displayed by the minibuffer were
written in English with intended left-to-right logic; RTL characters in
these messages are implicitly quoted, carrying no semantic meaning".
Are there examples of minibuffer messages that we agree should be RTL?

Post by Eli Zaretskii
I think this is because the minibuffer and the echo area are not the
same thing. They just use the same portion of the Emacs display.

Absolutely correct. Learn something new every day.

Post by Eli Zaretskii
Does it work to set bidi-display-reordering in two buffers named
" *Echo Area 0*" and " *Echo Area 1*"?

Absolutely correct again! So now I can have it if I really want it.

Another way to handle the case of an English paragraph that starts with
a Hebrew character is to insert an LRM. In this case that would need to
be done by the code that finds character bindings. I think that code
should indeed be sensitive to the fact that the unbound character it's
about to echo might set display direction.

The main point here is deciding whether echo area messages should be
displayed with left-to-right paragraph direction forced on the display
engine. Are we sure this is the case? Cannot there be echo area
messages that we want to display with the right-to-left direction?

Maybe. I ask again, do we have an example? What I'm saying here is
that certain parts of Emacs should be more careful in the face of bidi.
When Emacs wants to write "X is undefined" in the echo area with X
variable, maybe it should carefully put an LRM before the X because of
new potential side effects. This is something like a web programmer
being super cautious about sanitizing values that users type in.
cf. http://xkcd.com/327/

Finally, an even more subtle (and unimportant) issue: The actual
message I see looks like this: "is undefined ^ב”. But I would have
expected "is undefined ב^”, no? Shouldn't control-bet be written with
the uparrow on the right when in RTL mode?

I don't know which one is the correct one. Do we have any "prior art"
in that some other applications display Ctrl-modified Hebrew
characters?

Beats me. I just learned on the other thread that Windows may not even
admit the existence of such characters. I don't seem to be able to
insert them into a buffer. Probably they don't have Unicode codepoints.
Maybe they don't make sense at all. I will think about this further.

If there is such a thing as control-bet, then I think it should be
displayed as "ב^" in RTL text, and as "^ב" in LTR text.

Post by Eli Zaretskii
P.S. Thanks for starting these discussions. Sometimes I think that no
one is using the bidi features, which makes me wonder why I worked on
them so hard.

You worked on them for the joy of solving the problem, I hope. If you
think lots of people will be using them, I'm afraid you'll be sadly
disappointed.

I started using Emacs around 30 years ago, after grudgingly converting
from vi, which I grudgingly converted to from TECO. I still read mail
and write in Emacs---it's what I know. But I get mail in Hebrew and I
can't read it, nor answer without other tools. I've been waiting for
emacs bidi for years. I check around every few months, and only a
couple weeks ago that I saw that my wishes were finally fulfilled. It's
now a joy to read and answer mail. I'm very very grateful.

But I'm a dinosaur. Are there really any new emacs users these days?
I'd be very, very surprised.

/Larry Denenberg
***@denenberg.com
http://larry.denenberg.com/

Eli Zaretskii

2010-06-26 20:24:39 UTC

Date: Sat, 26 Jun 2010 15:17:01 -0400
=20
It's wrong because this is an LTR sentence that happens to start
with an RTL character, so the bidi code comes to the wrong
conclusion about directionality.

The issue here is not why this particular sentence is rendered
incorrectly. The issue is whether we can safely force the echo area
messages to _always_ be rendered with left-to-right paragraph
direction. This is what you are suggesting, right?

Such general conclusions cannot be reached by looking at a single
isolated example.

Aren't problems like this the entire raison d'etre of the invisible=

RLM

and LRM characters?

There's no argument that we _can_ display this message with L2R
paragraph direction. The question is: should we?

Are there examples of minibuffer messages that we agree should be R=

TL?

That is the important question. If someone could go over at least
some of the myriad of calls to `message' in Emacs and see if they all
tend to be L2R, I would agree that we should by default force L2R
paragraph direction on the echo area.

Post by Eli Zaretskii
Does it work to set bidi-display-reordering in two buffers named
" *Echo Area 0*" and " *Echo Area 1*"?

=20
Absolutely correct again! So now I can have it if I really want it=

.

Btw, there's something I overlooked before: why exactly is ^=D7=91
considered a strong R2L character? Could you please go to it in the
" *Echo Area 0/1*" buffer, type "C-u C-x =3D", and show what Emacs te=
lls
about that character?

If there is such a thing as control-bet, then I think it should be
displayed as "=D7=91^" in RTL text, and as "^=D7=91" in LTR text.

If this is the consensus, I'm okay with it.

I check around every few months, and only a couple weeks ago that I
saw that my wishes were finally fulfilled. It's now a joy to read
and answer mail. I'm very very grateful.

Thanks.

Larry Denenberg

2010-06-28 02:14:18 UTC

The issue is whether we can safely force the echo area messages to
_always_ be rendered with left-to-right paragraph direction. This is
what you are suggesting, right?

Well, I made several suggestions, of which this was one. Let me flesh
it out a little bit.

I think the best solution is this: Echo buffers and the minibuffer
should permit but not enforce bidirectionality, and anyone who writes to
them must be sensitive to this fact and appropriately careful. If you
want to echo "X is not defined" as an LTR sentence with X variable, it's
your job to be sure that X doesn't set the direction to something you
didn't intend.

The trouble is that zillions of messages were written without bidi in
mind, and (as we've seen) at least one doesn't do the right thing. So
what should we do? Check every message and fix all the offenders?
You first, Indy. In the meantime, what is a reasonable alternative?

And here I will stand by my suggestion for forcing LTR. The reasoning
is something like this: Emacs messages are written in English, which is
LTR. They may contain arbitrary text, but that text---even if displayed
RTL---is essentially in quotes and can't change the directionality.

But what is the language of a message that includes mixed Hebrew and
English words or letters?
Emacs allows you to mix several scripts (a.k.a. "languages") in the
same buffer, so it is no longer clear in what "language" the document
is written.

In my opinion, you are correct, but this fact is irrelevant to the
problem at hand. It may well be impossible to figure out the language
of a particular message by examining the text. But we're not trying to.
We know independently (or are trying to convince ourselves) that these
messages were written in English by English speakers with intended LTR
directionality.

If we accept that Emacs messages are intended as English, that's enough
to say that forcing LTR (to fix problems not foreseen by the writers) is
the right thing. Insofar as this problem is important enough to solve.

Here's another possibility. bidi-paragraph-direction is purely an Emacs
thing, right? It's not in the Unicode bidi standard. Is it absolute?
That is, can it be overridden by LRM or LRO characters? If so, we could
put these buffers in bidi mode with LTR default paragraph direction, but
anyone who really wants RTL can still force it. But I'm increasingly
skeptical that RTL is *ever* the right thing, unless you're writing a
completely new non-English Emacs. Can you give me an example of any
message in an English Emacs that should be RTL?

If someone could go over at least some of the myriad of calls to
`message' in Emacs and see if they all tend to be L2R, I would agree
that we should by default force L2R paragraph direction on the echo
area.

This is easy: Just look at your *Messages* buffer. Tell me if you see
anything RTL. Even "Wrote <filename>" with a long Hebrew filename is
LTR. You ask "if they all tend to be LTR"; I'm fairly convinced that
there isn't even a single one that's RTL.

Btw, there's something I overlooked before: why exactly is ^×
considered a strong R2L character? Could you please go to it in the "
*Echo Area 0/1*" buffer, type "C-u C-x =", and show what Emacs tells
about that character?

First of all, I don't think your procedure works. You can make the
message appear, and with care you can get a cursor on top of it, but
typing C-u (or most anything else) changes the buffer contents---it's
not called the Echo Area for nothing! To get your hands on the
character you'd have to write a function that grabs the contents of the
buffer and bind it to a key, or in some other way avoid echoing.

But there's no point in trying. The buffer can't possibly contain an
actual ^ב. No buffer can. Buffers and strings can contain only those
characters encodable in 22 bits. If your input facilities permit, you
can prove this by typing ^Q ^ב; Emacs refuses to insert such a character
(Wrong type argument: char-or-string-p, 67110353).

Amit Aronovitch

2010-06-28 06:14:36 UTC

The issue is whether we can safely force the echo area messages to
_always_ be rendered with left-to-right paragraph direction. This is
what you are suggesting, right?

Well, I made several suggestions, of which this was one. Let me flesh
it out a little bit.
I think the best solution is this: Echo buffers and the minibuffer
should permit but not enforce bidirectionality, and anyone who writes to
them must be sensitive to this fact and appropriately careful. If you
want to echo "X is not defined" as an LTR sentence with X variable, it's
your job to be sure that X doesn't set the direction to something you
didn't intend.
The trouble is that zillions of messages were written without bidi in
mind, and (as we've seen) at least one doesn't do the right thing. So
what should we do? Check every message and fix all the offenders?
You first, Indy. In the meantime, what is a reasonable alternative?
And here I will stand by my suggestion for forcing LTR. The reasoning
is something like this: Emacs messages are written in English, which is
LTR. They may contain arbitrary text, but that text---even if displayed
RTL---is essentially in quotes and can't change the directionality.

My own suggestion was to set the direction according to the language in
LC_MESSAGES ("system-messages-locale" in emacs) if the proper translation is
installed, LTR otherwise. However, a quick search seems to indicate that at
the moment there is no i18n for emacs at all (there is a new i18n project,
but I did not find any code there:
http://savannah.nongnu.org/projects/emacs-i18n ).
Given the above, my suggestion becomes identical to yours: set
directionality to LTR in system messages, possibly leaving an option for the
user to force to RTL in specific messages.

[snipped some arguments, to which I fully agree]

Btw, there's something I overlooked before: why exactly is ^Ãâ
considered a strong R2L character? Could you please go to it in the "
*Echo Area 0/1*" buffer, type "C-u C-x =", and show what Emacs tells
about that character?

How exactly did you get the ^x?
The echo messages that I see here look like C-x.
Also, I was able to get it in the minibuffer by using the interactive
global-set-key command. Seems like what was inserted in the buffer was
actually "C","-","×".

But there's no point in trying. The buffer can't possibly contain an

Post by Larry Denenberg
actual ^×. No buffer can. Buffers and strings can contain only those
characters encodable in 22 bits. If your input facilities permit, you
can prove this by typing ^Q ^×; Emacs refuses to insert such a character

Larry Denenberg

2010-06-28 11:43:31 UTC

How exactly did you get the ^x?
The echo messages that I see here look like C-x.

I set the Macintosh-wide input method to Hebrew (with apple-space, which
Emacs doesn't see at all) and then type control-a. And I see, against
the right margin: "is undefined א^".

Also, I was able to get it in the minibuffer by using the interactive
global-set-key command. Seems like what was inserted in the buffer was actually
"C","-","א".

I thought of this too, but it doesn't work for me. Interactive global
set key doesn't put anything in my minibuffer. I wonder why the
discrepancy.

I did better by doing global-set-key while defining a keyboard macro,
then doing edit-last-kbd-macro. The relevant line in that buffer is
C-\2720 ;; keyboard-quit
with all characters just standard visible ones.

Here's what I think is happening: The code that complains about
undefined characters handles uninsertable characters (things like ^ב and
meta-control-mouse-down) by translating them to visible representation.
So the message contains a real caret followed by ב. That is, the first
character has no strong directionality, and the directionality is set by
the second character, a non-control ב.
But then the reordering algorithm would have rendered it ב^ and not ^ב as you
saw.

Absolutely correct. But I do see ב^, not ^ב! And look at my first
paragraph above where I claim to see א^. I went back to my original
post, expecting to confirm this, and saw:

Finally, an even more subtle (and unimportant) issue: The actual
message I see looks like this: "is undefined ^ב”. But I would have
expected "is undefined ב^”, no? Shouldn't control-bet be written
with the uparrow on the right when in RTL mode?

So my report was incorrect. I can't reproduce it. I'm at a loss to
understand how I could have made such a mistake---I'd think I would have
checked such a subtlety a zillion times, especially when having the
audacity to complain about it. Complaint retracted. Apologies to all.

/Larry Denenberg
***@denenberg.com
http://larry.denenberg.com/

Larry Denenberg

2010-06-28 12:10:50 UTC

Post by Amit Aronovitch
Also, I was able to get it in the minibuffer by using the interactive
global-set-key command. Seems like what was inserted in the buffer was actually
"C","-","א".

I thought of this too, but it doesn't work for me. Interactive global
set key doesn't put anything in my minibuffer. I wonder why the
discrepancy.

Oops; just noticed my misreading. I did global-set-key and looked in
the *Messages* buffer, not the minibuffer. Of course global-set-key
puts stuff in the minibuffer.

/Larry Denenberg
***@denenberg.com
http://larry.denenberg.com/

Eli Zaretskii

2010-06-28 19:31:22 UTC

Date: Mon, 28 Jun 2010 09:14:36 +0300
=20

Post by Eli Zaretskii
Btw, there's something I overlooked before: why exactly is ^=

=C3=97=E2=80=98

Post by Eli Zaretskii
considered a strong R2L character? Could you please go to it in=

the "

Post by Eli Zaretskii
*Echo Area 0/1*" buffer, type "C-u C-x =3D", and show what Emacs=

tells

Post by Eli Zaretskii
about that character?

First of all, I don't think your procedure works. You can make t=

message appear, and with care you can get a cursor on top of it, =

but

typing C-u (or most anything else) changes the buffer contents---=

it's

not called the Echo Area for nothing! To get your hands on the
character you'd have to write a function that grabs the contents =

of the

buffer and bind it to a key, or in some other way avoid echoing.

How exactly did you get the ^x?
The echo messages that I see here look like C-x.

That's exactly what bothers me.

maybe the handling of uninsertables is done AFTER reordering

No. Reordering is always _after_ any "handling". In Emacs, redispla=
y
is generally done only when Emacs is idle. And bidi reordering is
part of redisplay.

so from the POV of the reordering
algorithm it is considered a single character as Eli said.

Yes, but which character? What I want to know is, if it wasn't =D7=
=91
followed or preceded by a caret, how come Emacs decided to render thi=
s
paragraph right-to-left? The bidirectional properties of characters
are derived from the UnicodeData.txt file, the Unicode database of
character properties, so no character that is not in that database ca=
n
ever be considered strong R. (In fact, any character not in that
database will cause Emacs to abort.)

Larry Denenberg

2010-06-28 23:07:16 UTC

What I want to know is, if it wasn't ב followed or preceded by a caret,
how come Emacs decided to render this paragraph right-to-left?

I'm saying that it *was* ב preceded by a caret.

Post by Larry Denenberg
But there's no point in trying. The buffer can't possibly contain an
actual ^ב. No buffer can. Buffers and strings can contain only those
characters encodable in 22 bits. If your input facilities permit, you
can prove this by typing ^Q ^ב; Emacs refuses to insert such a character
(Wrong type argument: char-or-string-p, 67110353).

[Character] ^ב is just another Emacs display feature, like ^C. Emacs
has special code in its display engine to produce such two-character
combinations to display an otherwise unprintable character as a string
that any terminal will show without any problem. But Emacs still knows
that these two characters stand for a single character, and "C-u C-x ="
will tell you which one.

Sorry for being unclear. Let me rephrase more precisely.

The character ^ב exists. It is a ב with the 27-th bit set, which is to
say, 1489 + 2^26 = 67110353. This character can't appear in any Emacs
string or buffer (cf. error message above).

You are correct that sometimes Emacs carefully displays an unprintable
character with a multi-char combination, when there's really only a
single character in the buffer. That's not happening here. Here, Emacs
has a character that can't be inserted into a buffer. So it carefully
inserts a multi-char combination instead. But in this case, it's *not*
a single character, and C-u C-x = will tell us nothing. Said another
way: It's the inserter, not the displayer, that's being careful.

I apologized in a previous note for taking us all down this "display of
control-bet" red herring. Let me repeat the apology: I incorrectly
reported bad behavior, and indeed there is no issue. It doesn't matter
what the directionality of ^ב is, because the bidi routines, which are
responsible for displaying buffer contents, will never see such a
character. And since the buffer actually contains first caret then ב,
we automatically get the correct behavior in both RTL and LTR modes.

The only conceivable question is whether the ^ב that I see is wrong
because it should be C-ב. I promise to forget this if you will.

/Larry Denenberg
***@denenberg.com
http://larry.denenberg.com/

Eli Zaretskii

2010-06-29 03:05:55 UTC

Date: Mon, 28 Jun 2010 19:07:16 -0400
=20
The character ^=D7=91 exists. It is a =D7=91 with the 27-th bit se=

t, which is to

say, 1489 + 2^26 =3D 67110353.

If what Emacs saw was a =D7=91 with the 27-th bit set, it should have
displayed C-=D7=91, not =D7=91^ with the message that says it is unde=
fined. At
least, that's my reading of the code.

Eli Zaretskii

2010-06-28 19:13:17 UTC

Date: Sun, 27 Jun 2010 22:14:18 -0400
=20
bidi-paragraph-direction is purely an Emacs thing, right? It's not
in the Unicode bidi standard.

bidi-paragraph-direction is one of the Emacs-specific aspects of what
UAX#9 calls ``higher protocols'' for determining the base direction o=
f
paragraphs.

Is it absolute? That is, can it be overridden by LRM or LRO
characters?

No, not currently. The code that determines base paragraph direction
looks at the value of bidi-paragraph-direction, and if that's non-nil=
,
it doesn't bother looking for the first strong directional character
in the paragraph.

But this is Emacs: Lisp code that wants to override the default value
of bidi-paragraph-direction can always let-bind it to any value it
wants, including nil. Then LRM etc. will have their normal effect.

Can you give me an example of any message in an English Emacs that
should be RTL?

I would need to wade through the many uses of `message' to see if
there are any. In general, any echo-area message that shows just
portions of buffer text (as opposed to a message generated by Emacs t=
o
convey some information to the user) might need RTL paragraphs if the
text comes from a buffer written in some bidirectional script. But I
don't know off the top of my head which features use that, although
I'm pretty much sure there are such features.

Btw, there's something I overlooked before: why exactly is ^=C3=

=97=C2=91

considered a strong R2L character? Could you please go to it in t=

he "

*Echo Area 0/1*" buffer, type "C-u C-x =3D", and show what Emacs t=

ells

about that character?

=20
First of all, I don't think your procedure works. You can make the
message appear, and with care you can get a cursor on top of it, bu=

typing C-u (or most anything else) changes the buffer contents---it=

not called the Echo Area for nothing! To get your hands on the
character you'd have to write a function that grabs the contents of=

the

buffer and bind it to a key, or in some other way avoid echoing.

See my other message for how I would do that.

But there's no point in trying. The buffer can't possibly contain =

actual ^=D7=91. No buffer can. Buffers and strings can contain on=

ly those

characters encodable in 22 bits. If your input facilities permit, =

you

can prove this by typing ^Q ^=D7=91; Emacs refuses to insert such a=

character

(Wrong type argument: char-or-string-p, 67110353).

^=D7=91 is just another Emacs display feature, like ^C. Emacs has sp=
ecial
code in its display engine to produce such two-character combinations
to display an otherwise unprintable character as a string that any
terminal will show without any problem. But Emacs still knows that
these two characters stand for a single character, and "C-u C-x =3D"
will tell you which one.

Here's what I think is happening: The code that complains about
undefined characters handles uninsertable characters (things like ^=

=D7=91 and

meta-control-mouse-down) by translating them to visible representat=

ion.

So the message contains a real caret followed by =D7=91. That is, =

the first

character has no strong directionality, and the directionality is s=

et by

the second character, a non-control =D7=91.

That'd be my guess as well, but I'd like to be sure. One thing that
puzzles me is where does that caret come from: the function which
displays the "X is undefined" is supposed to use the C- notation for
control-modified characters, not the ^ notation.

Amit Aronovitch

2010-06-27 03:30:27 UTC

Hi people,

First, thanks Eli and all contributors for the remarkable effort, and all
the recent progress!

Note that there are two separate issues:
(1) Directionality (I'll use here B to represent hebrew Bet):
Should the message be displayed "is undefined B^" (RTL paragraph dir) or
"^B is undefined" (LTR paragraph dir)

(2) Alignment (to right or left margin) - where that message is to be
displayed. It makes sense to align to the "start" direction (i.e. right
for RTL and left for LTR), but AFAIK this is a matter of style and not
within the scope of the unicode standard.

(2) is a relatively minor problem, while (1) could be a real source for
confusion to the reader.

Post by Eli Zaretskii
Why is it wrong?

True. There is no way to the determine 100% surely the correct direction of
a sentence out of context. That is why the unicode standard leaves the
freedom for "higher level protocol" to set that (
http://unicode.org/reports/tr9/ HL1) .
When such information is not available, a simple default algorithm is
described by the standard (rules P2, P3). This is implemented by common bidi
reordering libs, and I guess this is the reason for what you see here.

Post by Larry Denenberg
Aren't problems like this the entire raison d'etre of the invisible RLM
and LRM characters?

One of the main reasons. True. But, depending on the bidi reordering
function used, the application might be able to achieve the results by
providing this "higher level choice" itself. With libfribidi, the
"pbase_dir" input parameter can be used for that.

Since system minibuffer are always in English, maybe the minibuffer
should never be in display reorder mode.

What do you mean by ``minibuffer are always in English''? The
language and the paragraph direction are not necessarily related.

IMO, since the echo messages are typically one-liners, their directionality
should be defined by their language.
In Unix, if the message is translated to an RTL language (i.e. if
LC_MESSAGES is Arabic/Hebrew/Persian and the proper entry exists in the
translation file), then dir should be RTL.

Otherwise (as in the case you reported indeed), it should be set LTR.

I think this should work correctly 99% of the cases (In fact, at the moment,
I cannot think any realistic case where it would fail).

Post by Eli Zaretskii
I think this is because the minibuffer and the echo area are not the
same thing. They just use the same portion of the Emacs display.

Absolutely correct. Learn something new every day.

Post by Eli Zaretskii
Does it work to set bidi-display-reordering in two buffers named
" *Echo Area 0*" and " *Echo Area 1*"?

Absolutely correct again! So now I can have it if I really want it.

There could be, if the messages themselves are in Hebrew (via LC_MESSAGES
and translation files).
I do not know if Emacs really has Arabic/Hebrew translations, but there is
no reason why it should not be translated if it had not been done by now.

Post by Larry Denenberg
Maybe. I ask again, do we have an example? What I'm saying here is
that certain parts of Emacs should be more careful in the face of bidi.
When Emacs wants to write "X is undefined" in the echo area with X
variable, maybe it should carefully put an LRM before the X because of
new potential side effects. This is something like a web programmer
being super cautious about sanitizing values that users type in.
cf. http://xkcd.com/327/

Finally, an even more subtle (and unimportant) issue: The actual
message I see looks like this: "is undefined ^×â. But I would have
expected "is undefined ×^â, no? Shouldn't control-bet be written with
the uparrow on the right when in RTL mode?

Don't know about "should" (because as you said, both of them look "wrong").
However if you let the standard unicode algorithm reorder the logical string
"^B is undefined" with the default auto-detected directionality, it really
does result with what you seem to expect (the circumflex (0x5e) is a
neutral, and gets the directionality of the run). Maybe this is not really a
circumflex, or maybe some other magic is at work here.

Post by Larry Denenberg
I don't know which one is the correct one. Do we have any "prior art"

Post by Eli Zaretskii
in that some other applications display Ctrl-modified Hebrew
characters?

Not AFAIK. Unicode is about plaintext, not keyboard codes.
I do not know of any keyboard codes to ctrl-hebrew chars either - details

Amit Aronovitch

2010-06-27 03:48:01 UTC

Sorry, seems like I forgot to mention explicitly:
The following excerpt from my previous message tries to differentiate
between the codes produced when you press "ctrl-t" and what you get when you
press the same keys while in Hebrew mode (reported results refer to this
specific scenario, which was chosen as an example).

On Sun, Jun 27, 2010 at 6:30 AM, Amit Aronovitch <***@gmail.com>wrote:

Eli Zaretskii

2010-06-27 17:25:58 UTC

Date: Sun, 27 Jun 2010 06:30:27 +0300
=20
First, thanks Eli and all contributors for the remarkable effort, a=

nd all

the recent progress!

You're most welcome.

Should the message be displayed "is undefined B^" (RTL paragraph=

dir) or

"^B is undefined" (LTR paragraph dir)
=20
(2) Alignment (to right or left margin) - where that message is to=

displayed. It makes sense to align to the "start" direction (i.e=

. right

for RTL and left for LTR), but AFAIK this is a matter of style and =

not

within the scope of the unicode standard.
=20
(2) is a relatively minor problem, while (1) could be a real so=

urce for

confusion to the reader.

In Emacs, (2) is entirely determined by (1): a L2R paragraph is
displayed flushed all the way to the left margin of the window, while
R2L paragraphs are flushed to the right margin.

I don't see any reason to have the paragraph and alignment be
independent. Every bidi-aware word processor I've seen behaves like =
I
described above, and I'm quite sure users expect that.

True. There is no way to the determine 100% surely the correct dire=

ction of

a sentence out of context. That is why the unicode standard leaves =

the

freedom for "higher level protocol" to set that (
http://unicode.org/reports/tr9/ HL1) .
When such information is not available, a simple default algorithm =

described by the standard (rules P2, P3). This is implemented by co=

mmon bidi

reordering libs, and I guess this is the reason for what you see he=

re.

Emacs doesn't use any reordering libraries, but it does implement
UAX#9 to the letter, including determining the paragraph direction
=66rom its first strong directional character.

Aren't problems like this the entire raison d'etre of the invisib=

le RLM

and LRM characters?

One of the main reasons. True. But, depending on the bidi reorderin=

function used, the application might be able to achieve the results=

providing this "higher level choice" itself. With libfribidi, the
"pbase_dir" input parameter can be used for that.

In Emacs, we have the bidi-paragraph-direction variable, which
overrides the direction determined by the first strong character.

IMO, since the echo messages are typically one-liners, their direct=

ionality

should be defined by their language.

But what is the language of a message that includes mixed Hebrew and
English words or letters?

Emacs allows you to mix several scripts (a.k.a. "languages") in the
same buffer, so it is no longer clear in what "language" the document
is written.

Don't know about "should" (because as you said, both of them look "=

wrong").

However if you let the standard unicode algorithm reorder the logic=

al string

"^B is undefined" with the default auto-detected directionality, it=

really

does result with what you seem to expect (the circumflex (0x5e) is =

neutral, and gets the directionality of the run). Maybe this is not=

really a

circumflex, or maybe some other magic is at work here.

If "^" were a normal character, I'd agree (and Emacs would then rende=
r
them automatically per UAX#9 anyway). But this is not the case.
Here, the string ^B or B^ is a display feature; the display engine
produces these two characters as a single display element, and cursor
motion treats them both as a single atomic entity. The question is:
within that atomic entity, how should we display the "^" part?

Don't get me wrong: if the consensus is that we should display this a=
s
if we had 2 distinct characters, using UAX#9 reordering rules, I'm
okay with that.

Amit Aronovitch

2010-06-28 00:23:34 UTC

Date: Sun, 27 Jun 2010 06:30:27 +0300
First, thanks Eli and all contributors for the remarkable effort, and all
the recent progress!

You're most welcome.

Should the message be displayed "is undefined B^" (RTL paragraph dir)

"^B is undefined" (LTR paragraph dir)
(2) Alignment (to right or left margin) - where that message is to be
displayed. It makes sense to align to the "start" direction (i.e.

right

for RTL and left for LTR), but AFAIK this is a matter of style and not
within the scope of the unicode standard.
(2) is a relatively minor problem, while (1) could be a real source

for

confusion to the reader.

In Emacs, (2) is entirely determined by (1): a L2R paragraph is
displayed flushed all the way to the left margin of the window, while
R2L paragraphs are flushed to the right margin.

This is perfectly acceptable. I just wanted to point out the problem more
clearly, as the OP named the *alignment* as being wrong (which is correlated
to, but not exactly the actual problem).

I don't see any reason to have the paragraph and alignment be

Post by Eli Zaretskii
independent. Every bidi-aware word processor I've seen behaves like I
described above, and I'm quite sure users expect that.

Of course. This is the most reasonable default.
However, word processors typically also have an option for selectively
modifying the alignment without effecting the directionality (toolbars have
separate buttons for directionality and alignment), and this gets them out
of sync.
(Such explicit alignment information might not be saved in plain-text files,
but might be useful for "rich" formats - maybe in w3 mode etc.)
One example where this might be useful is when you have a list of items
(names, addresses, cited references), some of which RTL and some LTR, and
you wish the whole list to align to a single margin, to avoid a ragged
appearance. Another example is within tables.

Post by Eli Zaretskii
True. There is no way to the determine 100% surely the correct direction of

a sentence out of context. That is why the unicode standard leaves the
freedom for "higher level protocol" to set that (
http://unicode.org/reports/tr9/ HL1) .
When such information is not available, a simple default algorithm is
described by the standard (rules P2, P3). This is implemented by common

bidi

reordering libs, and I guess this is the reason for what you see here.

Emacs doesn't use any reordering libraries, but it does implement
UAX#9 to the letter, including determining the paragraph direction
from its first strong directional character.

Would be nice if we would be able to specify the direction explicitly
(manually) for selected paragraphs in the buffer. This can be stored in the
same way that other metadata (font sizes? color? images?) is being handled.
(p.s. If the buffer is plaintext, this information would probably be lost
when we save it. Still it might serve as a "manual override" to help
readability as long as the buffer is open).

Aren't problems like this the entire raison d'etre of the invisible RLM

Post by Larry Denenberg
and LRM characters?

In Emacs, we have the bidi-paragraph-direction variable, which
overrides the direction determined by the first strong character.

Is that per-buffer? What if you want to control directionality of specific
paragraphs? (you should be able to do that to properly show bidi text e.g.
in w3 mode).

IMO, since the echo messages are typically one-liners, their

directionality

should be defined by their language.

But what is the language of a message that includes mixed Hebrew and
English words or letters?

In all cases I can think of, the language of the message (the messages to be
displayed in the echo area) should be as specified by the locale
(LC_MESSAGES). This is because if the locale is English, the message itself
(the informative wrapper, the template) is actually meant to be in English,
and any Hebrew parts come from quoted characters etc. (template data,
runtime variables). Vice versa for the case where LC_MESSAGES=he .

(Explanation for readers who are not familiar with the terms: Typically, for
i18n support in Unix apps, you write default messages in English, and print
them using e.g. GNU gettext (3). If a translation file (provided by relevant
translation team) exists for the language specified by the user's locale,
this causes the message to be printed in that language. The translated
message itself may be merely a template, which includes placeholders for
inserting runtime data).

Emacs allows you to mix several scripts (a.k.a. "languages") in the

Post by Eli Zaretskii
same buffer, so it is no longer clear in what "language" the document
is written.

Don't know about "should" (because as you said, both of them look

"wrong").

However if you let the standard unicode algorithm reorder the logical

string

"^B is undefined" with the default auto-detected directionality, it

really

does result with what you seem to expect (the circumflex (0x5e) is a
neutral, and gets the directionality of the run). Maybe this is not

really a

circumflex, or maybe some other magic is at work here.

If "^" were a normal character, I'd agree (and Emacs would then render
them automatically per UAX#9 anyway). But this is not the case.
Here, the string ^B or B^ is a display feature; the display engine
produces these two characters as a single display element, and cursor
within that atomic entity, how should we display the "^" part?

OK, that kind of "other magic" then :-)

Post by Eli Zaretskii
Don't get me wrong: if the consensus is that we should display this as
if we had 2 distinct characters, using UAX#9 reordering rules, I'm
okay with that.

To me at least, it does seem better to show it as in UAX#9.
However, it seems that I cannot reproduce the scenario at the moment (see
below).

Eli Zaretskii

2010-06-28 19:00:09 UTC

Date: Mon, 28 Jun 2010 03:23:34 +0300
=20
Would be nice if we would be able to specify the direction explicit=

(manually) for selected paragraphs in the buffer.

You have that already, see below.

This can be stored in the
same way that other metadata (font sizes? color? images?) is being =

handled.

Emacs is primarily a text editor, and as such, it works with plain
text files. It doesn't store any metadata when it saves files that
use various faces on display. Instead, it recreates those faces anew
each time the file is visited and displayed.

The bidi support follows the same basic philosophy of working with
plain text files. That's why I didn't implement any means of saving
bidi formatting info with the file. The way of doing what you want i=
s
very simple: insert an LRM or RLM character in front of the
paragraph. This has the advantage of producing the same effect in an=
y
other bidi-aware application, while being invisible on display, at
least in Emacs. (And these two characters even have ISO-8859-8
encoding, so you don't even need UTF-8 support in those other apps.)

One of the main reasons. True. But, depending on the bidi reord=

ering

function used, the application might be able to achieve the res=

ults by

providing this "higher level choice" itself. With libfribidi, t=

"pbase_dir" input parameter can be used for that.

In Emacs, we have the bidi-paragraph-direction variable, which
overrides the direction determined by the first strong character.

=20
Is that per-buffer?

Yes.

What if you want to control directionality of specific paragraphs?

See above.

(you should be able to do that to properly show bidi text e.g. in
w3 mode).

Emacs does not yet handle HTML, XML, and other similar markup formats
with bidi text. Before we add such support, we need to design it.
UAX#9 is of no help here; we need to come up with our own solution,
preferably one that is based on existing Emacs features. This is a
significant job, so I put it aside for now.

Btw, I think we should support bidirectional text in comments and
strings in program sources _before_ we support bidirectional HTML.
After all, Emacs is a programmer's editor. The current plain-text
UAX#9 approach does not work well with, say, C sources that use
bidirectional text in strings and comments.

Post by Eli Zaretskii
Instead of looking in the code, it is much easier to put the curs=

or on

Post by Eli Zaretskii
the ctrl-=D7=90 thing, and type "C-u C-x =3D". Then Emacs will t=

ell you what

Post by Eli Zaretskii
it thinks about this character, including its codepoint.
Could you please do this? I need to know that in order to unders=

tand

Post by Eli Zaretskii
why Emacs treats this "character" as strong R. I cannot produce =

this

Post by Eli Zaretskii
strange character on MS-Windows, or else I'd do this myself.

=20
Not sure how to do that. It only appears in the echo area and I can=

not

insert it in a buffer (the message disappears if I try to click the
minibuffer or move the cursor there using keyboard shortcuts). By t=

he way,

the message that I see is "C-=D7=90 not defined", not ^=D7=90 as La=

rry described.

I tried binding the key to self-insert-command, and then I get a re=

gular =D7=90

inserted into the buffer.

You could either (a) look in "*Messages*" or (b) in the two echo-area
buffers, " *Echo Area 0*" and " *Echo Area 1*". Sorry, I thought it
was clear, given the previous discussions.

Actually, while typing the above, I realized that while I was tryin=

g to bind

the key, I had C-=D7=90 appearing in the mini-buffer. Checking, I s=

aw that in

that scenarion I can actually move the cursor around to it, and use=

C-u C-x

=3D. However, this reveals that the C-=D7=90 displayed there is act=

ually three

characters (C, -, =D7=90)...

That is what I would expect, but the original issue was with it being
displayed as =D7=90^ or ^=D7=90. To me, that says that Emacs did not=
recognize
this character as having the Ctrl modifier, because then it would hav=
e
displayed C-=D7=90.

Martin J. Dürst

2010-06-29 08:26:56 UTC

Hello everybody,

Date: Mon, 28 Jun 2010 03:23:34 +0300
(you should be able to do that to properly show bidi text e.g. in
w3 mode).

I think I have mentioned this before, but we have been doing some work
in the area of rendering HTML/XML source with bidi text. Please see
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi/ and the links from
there. (it still has quite a few problems, mostly related to editing Web
pages and JavaScript,...)

It also looks like we might get around to work on transposing our
solutions to Emacs this year. It's too early to promise anything, but
we'll try our best. And we will certainly be glad to get help and advice
from this list.

Post by Eli Zaretskii
Btw, I think we should support bidirectional text in comments and
strings in program sources _before_ we support bidirectional HTML.
After all, Emacs is a programmer's editor. The current plain-text
UAX#9 approach does not work well with, say, C sources that use
bidirectional text in strings and comments.

The problem of properly (or let's say decently or reasonably) displaying
(e.g. C) source code with bidi characters is in my estimation quite a
bit simpler than the problem for HTML or XML. So we might do it for
HTML/XML first and then cut down the solution to programming languages,
or the other way round.

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-06-29 17:34:47 UTC

Date: Tue, 29 Jun 2010 17:26:56 +0900
=20

Emacs does not yet handle HTML, XML, and other similar markup for=

mats

with bidi text. Before we add such support, we need to design it=

UAX#9 is of no help here; we need to come up with our own solutio=

preferably one that is based on existing Emacs features. This is=

significant job, so I put it aside for now.

=20
I think I have mentioned this before, but we have been doing some w=

ork=20

in the area of rendering HTML/XML source with bidi text. Please see
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi/ and the links fr=

om=20

there. (it still has quite a few problems, mostly related to editin=

g Web=20

pages and JavaScript,...)
=20
It also looks like we might get around to work on transposing our=

=20

solutions to Emacs this year. It's too early to promise anything, b=

ut=20

we'll try our best. And we will certainly be glad to get help and a=

dvice=20

from this list.

Thanks. You did mention this before, and I've read those pages.

However, it is difficult to judge their applicability to Emacs,
because virtually nothing is said regarding the implementation, excep=
t
that it "uses overlays".

So it would be good to discuss your suggested solution for Emacs, to
make sure that it fits. Please note that the "emacs-bidi" you used
for your work is radically different in the way it implements bidi
reordering from what was eventually designed and implemented in the
current Emacs development sources. So what worked with "emacs-bidi"
will not necessarily work well with the current mainline code.

The problem of properly (or let's say decently or reasonably) displ=

aying=20

(e.g. C) source code with bidi characters is in my estimation quite=

a=20

bit simpler than the problem for HTML or XML.

So maybe we should start with a simpler problem ;-)

The important thing is to establish whether we need some
infrastructure in Emacs that is currently missing, because that would
need to be coded first, before any user-visible progress can be made.

Amit Aronovitch

2010-06-29 23:04:30 UTC

Date: Tue, 29 Jun 2010 17:26:56 +0900

Post by Eli Zaretskii
Emacs does not yet handle HTML, XML, and other similar markup formats
with bidi text. Before we add such support, we need to design it.
UAX#9 is of no help here; we need to come up with our own solution,
preferably one that is based on existing Emacs features. This is a
significant job, so I put it aside for now.

I think I have mentioned this before, but we have been doing some work
in the area of rendering HTML/XML source with bidi text. Please see
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi/ and the links from
there. (it still has quite a few problems, mostly related to editing Web
pages and JavaScript,...)
It also looks like we might get around to work on transposing our
solutions to Emacs this year. It's too early to promise anything, but
we'll try our best. And we will certainly be glad to get help and advice
from this list.

Thanks. You did mention this before, and I've read those pages.

This is new to me. Thanks for mentioning this again :-)

Post by Eli Zaretskii
However, it is difficult to judge their applicability to Emacs,
because virtually nothing is said regarding the implementation, except
that it "uses overlays".
So it would be good to discuss your suggested solution for Emacs, to
make sure that it fits. Please note that the "emacs-bidi" you used
for your work is radically different in the way it implements bidi
reordering from what was eventually designed and implemented in the
current Emacs development sources. So what worked with "emacs-bidi"
will not necessarily work well with the current mainline code.

The problem of properly (or let's say decently or reasonably) displaying
(e.g. C) source code with bidi characters is in my estimation quite a
bit simpler than the problem for HTML or XML.

So maybe we should start with a simpler problem ;-)

Please have a look at the following draft Israeli standard (by SII):
http://db.tt/rc21Gd
(note that this is a draft that is being revised, it did not get to public
review stage yet, so I will not keep this link up for long, but any comments
are welcome even at this stage).
It tries to both define general guidelines and describe specific examples
for various syntaxes, from simple filenames, to computer languages and XML
documents. (It handles display alone. UI stuff like cursor movement is
covered by a different standard, which is planned to be revised in 2011, I
think).

The important thing is to establish whether we need some

Post by Eli Zaretskii
infrastructure in Emacs that is currently missing, because that would
need to be coded first, before any user-visible progress can be made.

I believe that the required infrastructure has a lot in common with the
coloring (font-lock) system. The basic requirement is recognizing
syntactical elements (such as identifiers, strings and comments).
I'd appreciate your input on this document.

AA

Larry Denenberg

2010-06-30 00:25:22 UTC

[much discussion of rendering markup text excised]

I don't have anything further to say on this topic; I just wish to
suggest that it's time to change the subject line.

/Larry Denenberg
***@denenberg.com
http://larry.denenberg.com/

Eli Zaretskii

2010-06-30 17:55:58 UTC

Date: Wed, 30 Jun 2010 02:04:30 +0300
=20
Please have a look at the following draft Israeli standard (by SII)=
http://db.tt/rc21Gd

Thanks for the reference. I didn't find there anything new to me, bu=
t
it will certainly be useful to have all that handy when working on th=
e
implementation of this for Emacs.

Post by Eli Zaretskii
The important thing is to establish whether we need some
infrastructure in Emacs that is currently missing, because that w=

ould

Post by Eli Zaretskii
need to be coded first, before any user-visible progress can be m=

ade.

=20
I believe that the required infrastructure has a lot in common with=

the

coloring (font-lock) system.

It's not that simple. The way bidi reordering is designed and
implemented in Emacs, the reordering itself happens _before_ faces,
overlays, and other display features are considered. The bidi
reordering engine is totally oblivious to text properties, overlays,
images, etc.; it just controls which character will be considered nex=
t
for delivering it to the display, and all the rest, i.e. calculation
of the face of that character, its display metrics, etc. -- all this
happens _after_ reordering, in code that calls the reordering engine.

What we need is a way of telling the reordering engine to reorder onl=
y
portions of buffer text. This is the infrastructure I was thinking
about, because I don't think we have anything like that at this time.
And I'm not sure it is a good idea to base the implementation on text
properties or overlays, at least not text properties of the kind used
for fontification.

Martin J. Dürst

2010-07-01 01:50:30 UTC

Hello Eli, Amit,

Date: Wed, 30 Jun 2010 02:04:30 +0300

Post by Eli Zaretskii
The important thing is to establish whether we need some
infrastructure in Emacs that is currently missing, because that would
need to be coded first, before any user-visible progress can be made.

I believe that the required infrastructure has a lot in common with the
coloring (font-lock) system.

I think what Amit meant is that there is quite some similarity between
the code that analyzes e.g. a C file to find out which parts are string
constants,... for coloring and the code that we will need to find parts
such as string constants,... for improved bidi display. I fully agree
with this. Ideally, the additional emacs-lisp code that we will need for
the various modes for each programming language or format will be just
minor additions to what's already there for syntax coloring,...

That should not be in conflict with the actual implementation of bidi,
coloring,... in the display engine, which happens at a quite different
level, and in a different order, as described above by Eli.

Post by Eli Zaretskii
What we need is a way of telling the reordering engine to reorder only
portions of buffer text. This is the infrastructure I was thinking
about, because I don't think we have anything like that at this time.
And I'm not sure it is a good idea to base the implementation on text
properties or overlays, at least not text properties of the kind used
for fontification.

Please note that we already have a way for telling the reordering engine
to "reorder only portions of buffer text", or alternatively, to "reorder
portions of buffer text differently", in the Unicode Bidi algorithm.
It's called Overrides and Embeddings. According to our experience
implementing better display for HTML and XML in HTML, being able to add
Overrides and Embeddings (and RLM/LRM marks) in a way that does not
affect the actual text in a buffer (e.g. as it would be stored to a
file) should be sufficient for getting the job done. I think we should
first explore this path and only if it fails should we start to create
additional infrastructure.

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-07-01 03:14:43 UTC

Date: Thu, 01 Jul 2010 10:50:30 +0900
=20
Please note that we already have a way for telling the reordering e=

ngine=20

to "reorder only portions of buffer text", or alternatively, to "re=

order=20

portions of buffer text differently", in the Unicode Bidi algorithm=

.=20

It's called Overrides and Embeddings. According to our experience=

=20

implementing better display for HTML and XML in HTML, being able to=

add=20

Overrides and Embeddings (and RLM/LRM marks) in a way that does not=

=20

affect the actual text in a buffer (e.g. as it would be stored to a=

=20

file) should be sufficient for getting the job done. I think we sho=

uld=20

first explore this path and only if it fails should we start to cre=

ate=20

additional infrastructure.

You are suggesting to insert bidirectional format characters into the
buffer text in order to affect the display. That's a no-no, IMO:
Emacs should not modify buffer text for display purposes. If we do
what you propose, we will open a can of worms, whereby bugs or crashe=
s
will cause Emacs to produce a file that is different from input, even
if you didn't edit the file at all.

Martin J. Dürst

2010-07-01 06:37:35 UTC

Hello Eli,

Post by Eli Zaretskii
You are suggesting to insert bidirectional format characters into the

I agree that that's a no-no, for the reasons you give below. But that's
not what I was suggesting or thinking about. What I was suggesting
(actually, the idea is originally from Kenichi Handa and/or Naoto
Takahashi) is that these bidirectional formatting characters go into the
text only 'virtually', e.g. in the before-string or after-string
properties of an overlay (see
http://www.gnu.org/software/emacs/elisp/html_node/Overlay-Properties.html#Overlay-Properties).
In that way, In my understanding, they are not part of the text buffer,
and will not be saved when saving the file.

Of course, if the characters in the overlay properties before-string and
after-string are not currently taken into account when running the bidi
algorithm, then that approach may not work very easily.

In any way, I think it's better to use the concepts already available in
the Unicode Bidi algorithm (override, embedding, marks) for improving
the display of XML, HTML, and other structured data and program source,
rather than to invent completely new concepts. Whether these concepts
then get transferred to the bidi algorithm via the (faked) insertion of
characters or via some other way (one could imagine to have properties
such as LRO/RLO/LRE/RLE on overlays,...) may be a secondary issue.

There are in my view two reasons for why it is better to reuse the concepts:
1) it is easier for "application-level" emacs-lisp programmers who work
on updating modes to improve bidi display.
2) it is easier for the core implementer(s), i.e. you, because they have
to work with only one algorithm.

Regards, Martin.

Post by Eli Zaretskii
Emacs should not modify buffer text for display purposes. If we do
what you propose, we will open a can of worms, whereby bugs or crashes
will cause Emacs to produce a file that is different from input, even
if you didn't edit the file at all.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-07-01 17:45:58 UTC

Date: Thu, 01 Jul 2010 15:37:35 +0900
=20
Hello Eli,
=20
=20

You are suggesting to insert bidirectional format characters into=

the

=20
I agree that that's a no-no, for the reasons you give below. But th=

at's=20

not what I was suggesting or thinking about.

Sorry.

What I was suggesting=20
(actually, the idea is originally from Kenichi Handa and/or Naoto=

=20

Takahashi) is that these bidirectional formatting characters go int=

o the=20

text only 'virtually', e.g. in the before-string or after-string=

=20

properties of an overlay (see=20
http://www.gnu.org/software/emacs/elisp/html_node/Overlay-Propertie=

s.html#Overlay-Properties).=20

In that way, In my understanding, they are not part of the text buf=

fer,=20

and will not be saved when saving the file.

Got it.

Of course, if the characters in the overlay properties before-strin=

g and=20

after-string are not currently taken into account when running the =

bidi=20

algorithm, then that approach may not work very easily.

You are right: they aren't taken into account. I have yet to code
support for reordering text in display strings. To add this feature,
I will need to solve quite a few problems. Until I do, I won't know
whether what you suggest is even doable with a reasonable effort.

I also think that, even if doable, this is a somewhat hackish
solution. I think having a special text property that covers the tex=
t
that needs to be reordered is a cleaner solution.

In any way, I think it's better to use the concepts already availab=

le in=20

the Unicode Bidi algorithm (override, embedding, marks) for improvi=

ng=20

the display of XML, HTML, and other structured data and program sou=

rce,=20

rather than to invent completely new concepts. Whether these concep=

ts=20

then get transferred to the bidi algorithm via the (faked) insertio=

n of=20

characters or via some other way (one could imagine to have propert=

ies=20

such as LRO/RLO/LRE/RLE on overlays,...) may be a secondary issue.

I think the upcoming Unicode 6.0 is already headed in that direction.
See http://www.unicode.org/reports/tr9/proposed.html#HL1. The text
above this explicitly says that these provisions are for XML, HTML,
and other structured text.

So I think we will be fine doing it in Emacs.

1) it is easier for "application-level" emacs-lisp programmers who =

work=20

on updating modes to improve bidi display.
2) it is easier for the core implementer(s), i.e. you, because they=

have=20

to work with only one algorithm.

I don't intend to change the bidi reordering engine in any significan=
t
way, to support these features. All that's needed is a possibility t=
o
tell it "restrict yourself to region between buffer positions P1 and
P2". Actually, it just descended on me that I can easily do that wit=
h
`narrow-to-region', since the reordering engine already honors that,
it never goes out of the accessible portion of text.

Martin J. Dürst

2010-07-02 01:04:35 UTC

Hello Eli,

Date: Thu, 01 Jul 2010 15:37:35 +0900
Hello Eli,

Post by Eli Zaretskii
You are suggesting to insert bidirectional format characters into the

I agree that that's a no-no, for the reasons you give below. But that's
not what I was suggesting or thinking about.

Sorry.

No problem. I should have been clearer.

What I was suggesting
(actually, the idea is originally from Kenichi Handa and/or Naoto
Takahashi) is that these bidirectional formatting characters go into the
text only 'virtually', e.g. in the before-string or after-string
properties of an overlay (see
http://www.gnu.org/software/emacs/elisp/html_node/Overlay-Properties.html#Overlay-Properties).
In that way, In my understanding, they are not part of the text buffer,
and will not be saved when saving the file.

Got it.

Of course, if the characters in the overlay properties before-string and
after-string are not currently taken into account when running the bidi
algorithm, then that approach may not work very easily.

You are right: they aren't taken into account. I have yet to code
support for reordering text in display strings. To add this feature,
I will need to solve quite a few problems. Until I do, I won't know
whether what you suggest is even doable with a reasonable effort.
I also think that, even if doable, this is a somewhat hackish
solution.

One thing that we should think about is what people want to happen if
there is actual displayable text in some of these strings. I don't have
much of an idea where this is used, but I can imagine that at least in
some usage scenarios, one might want the text added via an overlay to be
rendered in exactly the same way as the text in the buffer. In that
case, it's about user requirements, even if the solution might involve
some hacks.

Post by Eli Zaretskii
I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

It's definitely also a viable solution, although there also might be
some tricky issues. Say you have a property defining an embedding from
characters 10 to 30, and another such property from characters 20 to 40.
What exactly is that supposed to mean?

In any way, I think it's better to use the concepts already available in
the Unicode Bidi algorithm (override, embedding, marks) for improving
the display of XML, HTML, and other structured data and program source,
rather than to invent completely new concepts. Whether these concepts
then get transferred to the bidi algorithm via the (faked) insertion of
characters or via some other way (one could imagine to have properties
such as LRO/RLO/LRE/RLE on overlays,...) may be a secondary issue.

HL1 is indeed being reworked, but even without that rework, it already
provides the necessary leeway for what we want to do.

And please note that if we find out that something in 4.3, Higher-Level
Protocols, doesn't work for us, we can always ask for an addition or
clarification/correction. For example, in the context of programming
languages or HTML/XML, the sentence at the end of 4.3, "When text using
a higher-level protocol is to be converted to Unicode plain text, for
consistent appearance formatting codes should be inserted to ensure that
the order matches that of the higher-level protocol.", may be extremely
counterproductive. I already have written to the relevant Unicode
mailing list.

Post by Eli Zaretskii
So I think we will be fine doing it in Emacs.

1) it is easier for "application-level" emacs-lisp programmers who work
on updating modes to improve bidi display.
2) it is easier for the core implementer(s), i.e. you, because they have
to work with only one algorithm.

I don't intend to change the bidi reordering engine in any significant
way, to support these features. All that's needed is a possibility to
tell it "restrict yourself to region between buffer positions P1 and
P2". Actually, it just descended on me that I can easily do that with
`narrow-to-region', since the reordering engine already honors that,
it never goes out of the accessible portion of text.

I'm not sure I understand, but if it means that the bidi algorithm is
just applied piecewise, that won't be enough. It may be enough for some
simple cases, such as C programs, where the main concern is to keep text
within string constants together, and the rest is ASCII only and
therefore goes LTR. However, on the other hand, with some XML markup
with e.g. element and attribute names in Hebrew, in our experience
actual nestings (i.e. embeddings in terms of the bidi algorithm) are
highly desirable.

I think there are also other ways of attacking the problem. What about,
for example, a property on characters that increases the embedding level
in a certain way? Or a property that changes the bidi category of a
character?

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-07-02 10:38:38 UTC

Date: Fri, 02 Jul 2010 10:04:35 +0900

Note that I've changed the Subject line. It's time.

One thing that we should think about is what people want to happen =

if=20

there is actual displayable text in some of these strings. I don't =

have=20

much of an idea where this is used, but I can imagine that at least=

in=20

some usage scenarios, one might want the text added via an overlay =

to be=20

rendered in exactly the same way as the text in the buffer.

Can you explain what do you mean by the last sentence? Perhaps an
example will clarify that.

Post by Eli Zaretskii
I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

=20
It's definitely also a viable solution, although there also might b=

e=20

some tricky issues. Say you have a property defining an embedding f=

rom=20

characters 10 to 30, and another such property from characters 20 t=

o 40.=20

What exactly is that supposed to mean?

This cannot happen in Emacs, because each property can have only one
value for each character. In effect, ranges of buffer positions of t=
he
same text property cannot overlap.

In any case, if this were possible, it would first and foremost have
to be solved for the unidirectional display.

Post by Eli Zaretskii
I don't intend to change the bidi reordering engine in any signif=

icant

Post by Eli Zaretskii
way, to support these features. All that's needed is a possibili=

ty to

Post by Eli Zaretskii
tell it "restrict yourself to region between buffer positions P1 =

and

Post by Eli Zaretskii
P2". Actually, it just descended on me that I can easily do that=

with

Post by Eli Zaretskii
`narrow-to-region', since the reordering engine already honors th=

at,

Post by Eli Zaretskii
it never goes out of the accessible portion of text.

=20
I'm not sure I understand, but if it means that the bidi algorithm =

is=20

just applied piecewise, that won't be enough. It may be enough for =

some=20

simple cases, such as C programs, where the main concern is to keep=

text=20

within string constants together, and the rest is ASCII only and=

=20

therefore goes LTR. However, on the other hand, with some XML marku=

p=20

with e.g. element and attribute names in Hebrew, in our experience=

=20

actual nestings (i.e. embeddings in terms of the bidi algorithm) ar=

e=20

highly desirable.

Again, an example would go a long way towards explaining what you
mean. In general, what I wrote does not eliminate the possibility
that embeddings might be used within the reordered parts, nor that th=
e
text outside of the markup is LTR only.

I just meant to say that, technically, reordering of just a portion o=
f
text can be achieved by temporarily narrowing the buffer to that
portion, for as long as the display engine is processing that portion=
.

I think there are also other ways of attacking the problem. What ab=

out,=20

for example, a property on characters that increases the embedding =

level=20

in a certain way?

This idea was actually discussed some 10 years ago, as one of the
possible means of maintaining the reordering information as part of
the buffer. It was rejected because, as I explained above, text
properties cannot overlap, so maintaining this information would be a
pain when the buffer is edited: you would need to split and join
properties' ranges when embedding format codes are added or deleted.

Or a property that changes the bidi category of a character?

This can be done if we need it, but I still don't see use-cases that
would benefit from such a feature.

Martin J. Dürst

2010-07-06 07:18:17 UTC

Hello Eli, others,

Sorry for being late in replying.

Date: Fri, 02 Jul 2010 10:04:35 +0900

Note that I've changed the Subject line. It's time.

Thanks!

Can you explain what do you mean by the last sentence? Perhaps an
example will clarify that.

Well, let's assume that there is some arcane file format with settings,
and there is some Emacs lisp that adds additional text with overlays to
make it easier to understand the format. I'm sure there are other use
cases for such strings, otherwise, why would there be before-string and
after-string properties for overlays. Anyway, if there are both RTL and
LTR characters in one of these properties, these texts also need bidi
treatment. Even if there's only RTL, it has to be reordered for display.
Also, in some cases, the texts in the overlay properties may form units
that are best treated as embeddings (or similar), but in other cases,
they may better be treated as part of the overall text, and that overall
text should be processed with the bidi algorithm.

Post by Eli Zaretskii
I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

This cannot happen in Emacs, because each property can have only one
value for each character. In effect, ranges of buffer positions of the
same text property cannot overlap.

I see. But then that would make it rather difficult to define
embeddings, wouldn't it, because you have to include the number of
current embeddings and their orientation in the property. E.g.
something like (the characters a-g are just so that there's something
between the formatting codes):

a RLE b LRE c RLE d POP e POP f POP g

would translate into (writing each character on a separate line)

a
b RLE
c RLE LRE
d RLE LRE RLE
e RLE LRE
f RLE
g

Unless you add quite a bit of intermediate library code, this will be
rather inconvenient to handle for an end user.

Post by Eli Zaretskii
In any case, if this were possible, it would first and foremost have
to be solved for the unidirectional display.

You mean overlapping properties? In that case, I agree. But if
properties cannot overlap, maybe we should use overlays. As far as I
understand, they can overlap.

Post by Eli Zaretskii
I don't intend to change the bidi reordering engine in any significant
way, to support these features. All that's needed is a possibility to
tell it "restrict yourself to region between buffer positions P1 and
P2". Actually, it just descended on me that I can easily do that with
`narrow-to-region', since the reordering engine already honors that,
it never goes out of the accessible portion of text.

Okay. In the prototype and in the Web-based editor that we have worked
on to display HTML, we typically used embeddings for:
- Elements (incl. start tag and end tag) that have a dir attribute
(which indicates an embedding in the Web page view). These can of course
be nested.
- Start tags (and end tags)
- Attribute/attribute value combinations

Not all of these may be necessary in all cases, but it would be too
complicated to try and figure exactly which ones might be left out in
any particular case, and even this wouldn't eliminate the need for
nested embeddings. And it is at least currently unclear to me how you
could achieve nested embeddings with a possibility to tell the rendering
engine "restrict yourself to this region".

Post by Eli Zaretskii
I just meant to say that, technically, reordering of just a portion of
text can be achieved by temporarily narrowing the buffer to that
portion, for as long as the display engine is processing that portion.

Yes, if reordering of only a portion of text is sufficient to address
some problem, then this will be enough.

I think there are also other ways of attacking the problem. What about,
for example, a property on characters that increases the embedding level
in a certain way?

Or a property that changes the bidi category of a character?

This can be done if we need it, but I still don't see use-cases that
would benefit from such a feature.

Making the characters that define XML syntax, such as <, >, ", ', =,...
strong LTR would solve a lot (but not all) of the display anomalies for
XML (incl. HTML).

It might solve all display anomalies for programming languages like C to
define " (for strings) and comment start/end as LTR (at least as long as
there are no RTL identifiers).

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-07-07 10:59:03 UTC

Date: Tue, 06 Jul 2010 16:18:17 +0900
=20

One thing that we should think about is what people want to happ=

en if

there is actual displayable text in some of these strings. I don=

't have

much of an idea where this is used, but I can imagine that at le=

ast in

some usage scenarios, one might want the text added via an overl=

ay to be

rendered in exactly the same way as the text in the buffer.

Can you explain what do you mean by the last sentence? Perhaps a=

example will clarify that.

=20
Well, let's assume that there is some arcane file format with setti=

ngs,=20

and there is some Emacs lisp that adds additional text with overlay=

s to=20

make it easier to understand the format. I'm sure there are other u=

se=20

cases for such strings, otherwise, why would there be before-string=

and=20

after-string properties for overlays. Anyway, if there are both RTL=

and=20

LTR characters in one of these properties, these texts also need bi=

di=20

treatment. Even if there's only RTL, it has to be reordered for dis=

play.=20

There's no argument that text in display strings should be reordered.
I just didn't yet write code to handle that, but it's on my todo.

Also, in some cases, the texts in the overlay properties may form u=

nits=20

that are best treated as embeddings (or similar), but in other case=

s,=20

they may better be treated as part of the overall text, and that ov=

erall=20

text should be processed with the bidi algorithm.

I don't see any situation that RLE/LRE or RLO/LRO, as part of the
display string itself, won't be able to handle. Do you?

Post by Eli Zaretskii
I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

It's definitely also a viable solution, although there also migh=

t be

some tricky issues. Say you have a property defining an embeddin=

g from

characters 10 to 30, and another such property from characters 2=

0 to 40.

What exactly is that supposed to mean?

This cannot happen in Emacs, because each property can have only =

one

value for each character. In effect, ranges of buffer positions =

of the

same text property cannot overlap.

=20
I see. But then that would make it rather difficult to define=20
embeddings, wouldn't it, because you have to include the number of=

=20

current embeddings and their orientation in the property.

We may need to specify the base paragraph direction for each such
portion of buffer text, yes. But that is all; I don't see why we
would need to specify embedding level -- this can be handled with the
existing characters, RLE, RLO, etc.

IOW, what I thought about was that most of the text would be not
reordered (which is okay, since outside strings and comments, the res=
t
is strict L2R, mostly even 7-bit ASCII, text). Only the portions tha=
t
have the special property on them will be reordered, and that
reordering will be according to the normal UAX#9 rules. I still don'=
t
see which use-cases will need something more than this. And I mean
specific practical use-cases, not hypothetical ones.

E.g. something like (the characters a-g are just so that there's
=20
a RLE b LRE c RLE d POP e POP f POP g
=20
would translate into (writing each character on a separate line)
=20
a
b RLE
c RLE LRE
d RLE LRE RLE
e RLE LRE
f RLE
g
=20
Unless you add quite a bit of intermediate library code, this will =

be=20

rather inconvenient to handle for an end user.

I don't understand why this would be needed. Could you please presen=
t
a detailed example where this is needed?

You mean overlapping properties? In that case, I agree. But if=20
properties cannot overlap, maybe we should use overlays. As far as =

I=20

understand, they can overlap.

Overlays don't scale up well; having lots of them in a buffer slows
down redisplay to an annoyingly low speed. So I'd rather we didn't,
if we can find another solution. Again, I still don't see why we
would need this one, and what problems it is supposed to solve.

I'm not sure I understand, but if it means that the bidi algorit=

hm is

just applied piecewise, that won't be enough. It may be enough f=

or some

simple cases, such as C programs, where the main concern is to k=

eep text

within string constants together, and the rest is ASCII only and
therefore goes LTR. However, on the other hand, with some XML ma=

rkup

with e.g. element and attribute names in Hebrew, in our experien=

actual nestings (i.e. embeddings in terms of the bidi algorithm)=

are

highly desirable.

Again, an example would go a long way towards explaining what you
mean. In general, what I wrote does not eliminate the possibilit=

that embeddings might be used within the reordered parts, nor tha=

t the

text outside of the markup is LTR only.

=20
Okay. In the prototype and in the Web-based editor that we have wor=

ked=20

- Elements (incl. start tag and end tag) that have a dir attribute=

=20

(which indicates an embedding in the Web page view). These can of c=

ourse=20

be nested.
- Start tags (and end tags)
- Attribute/attribute value combinations
=20
Not all of these may be necessary in all cases, but it would be too=

=20

complicated to try and figure exactly which ones might be left out =

in=20

any particular case, and even this wouldn't eliminate the need for=

=20

nested embeddings. And it is at least currently unclear to me how y=

ou=20

could achieve nested embeddings with a possibility to tell the rend=

ering=20

engine "restrict yourself to this region".

Please show an actual fragment of HTML/XML which needs nesting or
embeddings.

Or a property that changes the bidi category of a character?

This can be done if we need it, but I still don't see use-cases t=

hat

would benefit from such a feature.

=20
Making the characters that define XML syntax, such as <, >, ", ', =

=3D,...=20

strong LTR would solve a lot (but not all) of the display anomalies=

for=20

XML (incl. HTML).

If it doesn't solve all the problems, I'd rather try first to find a
solution that does. We probably won't want to change the bidi
properties of a character for the entire buffer (because it could be
used elsewhere in the buffer, like in a comment, where we would want
it to be reordered normally). So this means we would need to use
different tables of bidi properties for different portions of the
text. Switching bidi properties during display, as it walks the
buffer, is doable, but is somewhat tricky and can raise some hard
problems. The fact that it is not a comprehensive solution makes me
even more reluctant to use it.

It might solve all display anomalies for programming languages like=

C to=20

define " (for strings) and comment start/end as LTR (at least as lo=

ng as=20

there are no RTL identifiers).

But quotes can appear in the comments as well, so I think here, too,
we won't be able to use the same properties for the entire buffer.

Covering each string, excluding its quotes, with a special text
property, and the same with a comment (excluding the comment
start/end) sounds a simpler solution.

Martin J. Dürst

2010-07-15 10:49:23 UTC

Hello Eli,

Sorry to be late with my reply.

Date: Tue, 06 Jul 2010 16:18:17 +0900

Post by Martin J. DÃ¼rst
One thing that we should think about is what people want to happen if
there is actual displayable text in some of these strings. I don't have
much of an idea where this is used, but I can imagine that at least in
some usage scenarios, one might want the text added via an overlay to be
rendered in exactly the same way as the text in the buffer.

Can you explain what do you mean by the last sentence? Perhaps an
example will clarify that.

There's no argument that text in display strings should be reordered.
I just didn't yet write code to handle that, but it's on my todo.

Okay.

Also, in some cases, the texts in the overlay properties may form units
that are best treated as embeddings (or similar), but in other cases,
they may better be treated as part of the overall text, and that overall
text should be processed with the bidi algorithm.

I don't see any situation that RLE/LRE or RLO/LRO, as part of the
display string itself, won't be able to handle. Do you?

It depends on where we allow the corresponding PDFs to go. (a) do the
PDFs need to be in the same piece of text (or, if a PDF is missing, do
we just close the embedding anyway at the end of that piece of text), or
(b) are embeddings (and overrides) allowed to span several of these text
pieces?

If you mean (b), then we should most probably be covered. If you mean
(a), I'm not so sure about it.

Post by Martin J. DÃ¼rst

Post by Eli Zaretskii
I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

This cannot happen in Emacs, because each property can have only one
value for each character. In effect, ranges of buffer positions of the
same text property cannot overlap.

I see. But then that would make it rather difficult to define
embeddings, wouldn't it, because you have to include the number of
current embeddings and their orientation in the property.

This again depends on the answer to (a)/(b) above.

Post by Eli Zaretskii
IOW, what I thought about was that most of the text would be not
reordered (which is okay, since outside strings and comments, the rest
is strict L2R, mostly even 7-bit ASCII, text). Only the portions that
have the special property on them will be reordered, and that
reordering will be according to the normal UAX#9 rules. I still don't
see which use-cases will need something more than this. And I mean
specific practical use-cases, not hypothetical ones.

I think for something like the C programming language, this view mostly
makes sense. But not all programming languages are that easy. As one
example, in many programming languages (Perl, Ruby, JavaScript,...),
regular expressions are part of the language syntax. There, you can have
complex hierarchies of (e.g. RTL) text and syntactic structure.

Also, in several programming languages, there is string interpolation.
This means that in the middle of a (let's assume RTL) string, one can go
back to code. And then of course in the middle of that code, one can go
back to strings. And string interpolation can also be used in regexps.

And then there is also the whole area of PHP, JSP, ASP,... where you
have by definition program code in the middle of (Web page) text, and of
course that program code can contain text again.

Not myself using any RTL language, I can only guess how users may want
to have such constructs displayed. But assuming that the examples in
Unicode TR 9 have some actual use, I'd assume that at least some of the
people involved, at least in some cases, would prefer structured
reordering rather than just piecewise reordering at the lowest level.

This also applies to XML/HTML. Let's take the following example from TR 9:
logical, with some LRE/RLE/PDF: DID YOU SAY ’he said “car MEANS CAR”‘?
With HTML markup:
DID YOU SAY ’he said
“car MEANS
CAR”‘?

To take just the innermost part here, would an user want to see
car RAC SNAEM
or would she like to see
RAC SNAEM car
or would she like to see
RAC SNAEM car<lang='en' span>
which looks confusing, but maybe not so much if the element name is in
RTL, too, which would then give something like
RAC SNAEM <NAPS/>car<lang='en' NAPS>

So for XML, especially with native markup, we can have the whole gamut
from little pieces (if anything) of RTL in a sea of LTR to little pieces
(if anything) of LTR in a sea of RTL, and lots of combinations and
nestings in the middle.

E.g. something like (the characters a-g are just so that there's
a RLE b LRE c RLE d POP e POP f POP g
would translate into (writing each character on a separate line)
a
b RLE
c RLE LRE
d RLE LRE RLE
e RLE LRE
f RLE
g
Unless you add quite a bit of intermediate library code, this will be
rather inconvenient to handle for an end user.

I don't understand why this would be needed. Could you please present
a detailed example where this is needed?

For the actual text being edited, see above. Of course, this specific
feature would only be needed if there's no other way to affect display
structure.

You mean overlapping properties? In that case, I agree. But if
properties cannot overlap, maybe we should use overlays. As far as I
understand, they can overlap.

To go back to the basics, we need a way (on first approximation, any way
may be okay) to tell the display reordering engine where and how it
should take into account the syntactic structure of the program/markup
being edited.

Whether that can best be done by
1) adding some bidi formatting codes to the text being edited,
2) adding some bidi formatting codes to display text from a property or
overlay (before-string and after-string)
3) adding some bidi-specific properties or overlay properties to
directly influence bidi reordering
4) some other means
is what I think we have been discussing. I continue to agree with you
that 1) is a bad idea. 2) is what Kenichi originally suggested. 3) is
what the example above is about.

I don't really mind too much which way we go, but given that I must
assume that the bidi algorithm has hierarchically nested embeddings for
a reason, and that programming languages and markup languages are in
many ways quickly much more nested than natural language (see examples
above), I don't think we can easily get away with a simplistic model of
"everything is LTR, with an occasional RTL string in it". That might
work for a programming language like C, but not for things like Perl,
Ruby, PHP, JSP, ASP, HTML, and XML.

Post by Martin J. DÃ¼rst
I'm not sure I understand, but if it means that the bidi algorithm is
just applied piecewise, that won't be enough. It may be enough for some
simple cases, such as C programs, where the main concern is to keep text
within string constants together, and the rest is ASCII only and
therefore goes LTR. However, on the other hand, with some XML markup
with e.g. element and attribute names in Hebrew, in our experience
actual nestings (i.e. embeddings in terms of the bidi algorithm) are
highly desirable.

Okay. In the prototype and in the Web-based editor that we have worked
- Elements (incl. start tag and end tag) that have a dir attribute
(which indicates an embedding in the Web page view). These can of course
be nested.
- Start tags (and end tags)
- Attribute/attribute value combinations
Not all of these may be necessary in all cases, but it would be too
complicated to try and figure exactly which ones might be left out in
any particular case, and even this wouldn't eliminate the need for
nested embeddings. And it is at least currently unclear to me how you
could achieve nested embeddings with a possibility to tell the rendering
engine "restrict yourself to this region".

Please show an actual fragment of HTML/XML which needs nesting or
embeddings.

See above.

Post by Martin J. DÃ¼rst
Or a property that changes the bidi category of a character?

This can be done if we need it, but I still don't see use-cases that
would benefit from such a feature.

Making the characters that define XML syntax, such as<,>, ", ', =,...
strong LTR would solve a lot (but not all) of the display anomalies for
XML (incl. HTML).

If it doesn't solve all the problems, I'd rather try first to find a
solution that does.

Agreed.

Post by Eli Zaretskii
We probably won't want to change the bidi
properties of a character for the entire buffer (because it could be
used elsewhere in the buffer, like in a comment, where we would want
it to be reordered normally). So this means we would need to use
different tables of bidi properties for different portions of the
text. Switching bidi properties during display, as it walks the
buffer, is doable, but is somewhat tricky and can raise some hard
problems.

The table lookup might be done beforehand, with Font lock or some
similar mechanism, and the result may be carried in properties.

Post by Eli Zaretskii
The fact that it is not a comprehensive solution makes me
even more reluctant to use it.

I agree with that. It may be possible to make it a comprehensive
solution if we can use LRE/RLE/.../PDF as a bidi property (i.e. say that
a plain old character also works as an LRE/RLE/.../PDF). But that may
not be enough, we might have to go as far as being able to attach a
sequence of bidi properties to a single character. Not exactly pretty :-(.

It might solve all display anomalies for programming languages like C to
define " (for strings) and comment start/end as LTR (at least as long as
there are no RTL identifiers).

But quotes can appear in the comments as well, so I think here, too,
we won't be able to use the same properties for the entire buffer.

True.

Post by Eli Zaretskii
Covering each string, excluding its quotes, with a special text
property, and the same with a comment (excluding the comment
start/end) sounds a simpler solution.

This works very nicely if there is no nesting. If you can tell me for
sure that nobody working with Perl, Ruby, PHP, JSP, ASP, HTML, XML,...
will prefer nested bidi reordering for some cases, that might solve the
problem. But I wouldn't want to make such an assertion.

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:***@it.aoyama.ac.jp

Eli Zaretskii

2010-07-15 13:00:02 UTC

Date: Thu, 15 Jul 2010 19:49:23 +0900
=20
Sorry to be late with my reply.

We are all busy people.

Post by Eli Zaretskii
I don't see any situation that RLE/LRE or RLO/LRO, as part of the
display string itself, won't be able to handle. Do you?

=20
It depends on where we allow the corresponding PDFs to go. (a) do t=

he=20

PDFs need to be in the same piece of text (or, if a PDF is missing,=

do=20

we just close the embedding anyway at the end of that piece of text=

), or=20

(b) are embeddings (and overrides) allowed to span several of these=

text=20

pieces?
=20
If you mean (b), then we should most probably be covered. If you me=

an=20

(a), I'm not so sure about it.

I meant (a). Anything else can be handled by providing the initial
value for the base embedding level, as part of the property, I think.

Also, in several programming languages, there is string interpolati=

on.=20

This means that in the middle of a (let's assume RTL) string, one c=

an go=20

back to code. And then of course in the middle of that code, one ca=

n go=20

back to strings. And string interpolation can also be used in regex=

ps.

=20
And then there is also the whole area of PHP, JSP, ASP,... where yo=

u=20

have by definition program code in the middle of (Web page) text, a=

nd of=20

course that program code can contain text again.

Something that a clever enough parser couldn't parse and set the
properties accordingly?

This also applies to XML/HTML. Let's take the following example fro=
logical, with some LRE/RLE/PDF: DID YOU SAY =92he said =93car MEANS=

CAR=94=91?

DID YOU SAY =92<span lang=3D'en' dir=

=3D'ltr'>he said=20

=93car MEANS=

=20

CAR=94=91?
=20
To take just the innermost part here, would an user want to see
car RAC SNAEM
or would she like to see
RAC SNAEM car
or would she like to see
RAC SNAEM car<lang=3D'en' span>
which looks confusing, but maybe not so much if the element name is=

in=20

RTL, too, which would then give something like
RAC SNAEM <NAPS/>car<lang=3D'en' NAPS>

How is this actually displayed by a browser, i.e. when the markup is
removed? That's how we should display it with the markup as well.
IOW, according to the markup rules.

I don't really mind too much which way we go, but given that I must=

=20

assume that the bidi algorithm has hierarchically nested embeddings=

for=20

a reason

It has them for a reason, but that reason is to allow programs to
produce text where these embeddings are already present by means of
the formatting control characters. If these formatting characters ar=
e
not there in the original text, we are not allowed to add them.

Post by Eli Zaretskii
We probably won't want to change the bidi
properties of a character for the entire buffer (because it could=

Post by Eli Zaretskii
used elsewhere in the buffer, like in a comment, where we would w=

ant

Post by Eli Zaretskii
it to be reordered normally). So this means we would need to use
different tables of bidi properties for different portions of the
text. Switching bidi properties during display, as it walks the
buffer, is doable, but is somewhat tricky and can raise some hard
problems.

=20
The table lookup might be done beforehand, with Font lock or some=

=20

similar mechanism, and the result may be carried in properties.

You seem to be thinking about performance; that's not the issue. The
issue is that the reordering engine was written under the assumption
that certain information remains static during reordering of a single
level run. If this assumption is violated, I don't know what will
happen. I never analyzed such a possibility.

Post by Eli Zaretskii
Covering each string, excluding its quotes, with a special text
property, and the same with a comment (excluding the comment
start/end) sounds a simpler solution.

=20
This works very nicely if there is no nesting. If you can tell me f=

or=20

sure that nobody working with Perl, Ruby, PHP, JSP, ASP, HTML, XML,=

...=20

will prefer nested bidi reordering for some cases, that might solve=

the=20

problem. But I wouldn't want to make such an assertion.

I'm willing to assume that for the time being, until someone comes an=
d
shows a use-case where this is a limitation. My experience with
Hebrew speaking programmers is that they avoid such mixups precisely
because they are not handled well by existing development tools.
Let's leave something for future extensions ;-)

Beni Cherniavsky-Paskin

2010-07-02 10:43:33 UTC

Post by Martin J. DÃ¼rst
Of course, if the characters in the overlay properties before-string and
after-string are not currently taken into account when running the bidi
algorithm, then that approach may not work very easily.

You are right: they aren't taken into account. I have yet to code
support for reordering text in display strings. To add this feature,
I will need to solve quite a few problems. Until I do, I won't know
whether what you suggest is even doable with a reasonable effort.
I also think that, even if doable, this is a somewhat hackish
solution. I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

I'm second-guessing Martin here (correct me if I'm wrong!) but I suspect
this proposal is based on:

(A) HTML bidi experience, where CSS already supports before/after-string,
and browsers respect bidi control codes appearing there (?)

(B) a presumption that it's the easiest for you because you planned to
implement it anyway.

However, since it's not implemented yet (and it's no obvious how to),
I want to point out that having proper bidi in strings and comments
is a much more useful goal (to users) than supporting bidi reordering of
before/after-string properties of overlays, so if there is a better way
to get the former, you should feel free to skip the latter.

Post by Martin J. DÃ¼rst
Date: Wed, 30 Jun 2010 02:04:30 +0300
I believe that the required infrastructure has a lot in common with the
coloring (font-lock) system.

If I may add some motivation:
In an ideal world, we would have a separate "bidi-lock" engine,
hand-tuned for each syntax in the world, parsing it and endowing
characters with whatever overlays/properties the display engine
needs to show it in perfect order.

But that's out of question because it'd require huge *manpower*
to maintain, and only a fraction of the world's programmers care
about bidi. A much more realistic (even if less perfect) approach
is to harness existing polished font-lock engine, defining bidi
in terms of it output (e.g. "characters with font-lock-string-face
should be treated as embeddings and not mix with outside text").

[I'm just saying that'd be cool; see below on feasibility...]

So currently there is no way whatsoever by which lisp code (even the
hypothetical perfect bidi-lock) can affect reordering? (Except for
actually inserting characters into the buffer, which we agree is no-no)

So to get smarter bidi, the bidi engine will have to inspect *some*
kind of metadata on the characters it handles. Now let's separate
two mostly orthogonal concerns:

1. What bidi metadata exactly do we want?
Emacs has many kinds of metadata (face info, invisibility,
local vars/keymap, hooks...) but we'll need new one(s).
- Virtual control characters?
- An increment to embedding level?
- Overridding character's unicode bidi cathegory?
...

It's an interesting discussion, but I don't have much to say now.

2. How will it be attached to the text?
It seems that Emacs has exactly two ways to attach metadata to text:
- Text properties
- Overlays (implemented by pairs of markers?)

Almost all metadata can be attached by both, although a few (e.g.
before/after-string) only work on overlays.

Also, there there are several indirection mechanisms, notably faces:
a lot of metadata (such as font size) is not directly attached to
text/overlays but to a face object (which is a property of text/overlay).

Let's concentrate on question 2 here.

Eli Zaretskii

2010-07-02 11:28:56 UTC

Date: Fri, 2 Jul 2010 13:43:33 +0300

.org

=20
I'm second-guessing Martin here (correct me if I'm wrong!) but I su=

spect

=20
(A) HTML bidi experience, where CSS already supports before/after-s=

tring,

and browsers respect bidi control codes appearing there (?)
=20
(B) a presumption that it's the easiest for you because you planned=

implement it anyway.

I'm not following, probably because I don't know enough about CSS.

This isn't possible, because overlays and text properties cannot be
used for storing reordering information. I explained in my previous
message why.

With the current implementation, bidi reordering happens only when a
portion of text is being delivered to the glass. Then the informatio=
n
about reordering is thrown away (and recomputed as needed each time
the same portion of text is about to be displayed again). Therefore,
it would be useless to try to "bidi-lock" invisible portions of the
buffer -- you will have no place to record the results of this
reordering anyway, and no way of reusing that information even if
recorded somewhere.

A much more realistic (even if less perfect) approach
is to harness existing polished font-lock engine, defining bidi
in terms of it output (e.g. "characters with font-lock-string-face
should be treated as embeddings and not mix with outside text").

Maybe. But note that this would mean we cannot properly display
comments and strings if font-lock mode is turned off in the buffer (o=
r
globally).

So currently there is no way whatsoever by which lisp code (even th=

hypothetical perfect bidi-lock) can affect reordering?

No. If we think this would be needed, a useful first step would be t=
o
identify the use-cases which require that.

(Except for actually inserting characters into the buffer, which we
agree is no-no)

It is a no-no for display purposes. But for other purposes,
e.g. forcing paragraph direction on specific paragraphs, it's actuall=
y
the way to go, because you do want that direction to be in force even
outside Emacs, in any other conforming application.

1. What bidi metadata exactly do we want?
Emacs has many kinds of metadata (face info, invisibility,
local vars/keymap, hooks...) but we'll need new one(s).
- Virtual control characters?
- An increment to embedding level?
- Overridding character's unicode bidi cathegory?
...
=20
It's an interesting discussion, but I don't have much to say now=

=20
2. How will it be attached to the text?
It seems that Emacs has exactly two ways to attach metadata to t=
- Text properties
- Overlays (implemented by pairs of markers?)
=20
Almost all metadata can be attached by both, although a few (e.g=

before/after-string) only work on overlays.
=20
Also, there there are several indirection mechanisms, notably fa=
a lot of metadata (such as font size) is not directly attached t=

text/overlays but to a face object (which is a property of text/=

overlay).

=20
Let's concentrate on question 2 here.

Again, I think we should start with identifying the use-cases which
would need these features.

Martin J. Dürst

2010-07-06 09:45:43 UTC

Post by Beni Cherniavsky-Paskin

Hello Beni, Eli, others,

Post by Beni Cherniavsky-Paskin

You are right: they aren't taken into account. I have yet to code
support for reordering text in display strings. To add this feature,
I will need to solve quite a few problems. Until I do, I won't know
whether what you suggest is even doable with a reasonable effort.
I also think that, even if doable, this is a somewhat hackish
solution. I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

I'm second-guessing Martin here (correct me if I'm wrong!) but I suspect
(A) HTML bidi experience, where CSS already supports before/after-string,
and browsers respect bidi control codes appearing there (?)

Well, browsers already implementing the bidi algorithm was one of the
motivations for doing all our previous work (simulation, editing
prototype) with HTML in browsers. I also once worked on a variant of the
editing prototype that used CSS before/after selectors, but this is
still not implemented in all editors, so we didn't make this the main
line of our work.

Post by Beni Cherniavsky-Paskin
(B) a presumption that it's the easiest for you because you planned to
implement it anyway.
However, since it's not implemented yet (and it's no obvious how to),
I want to point out that having proper bidi in strings and comments
is a much more useful goal (to users) than supporting bidi reordering of
before/after-string properties of overlays, so if there is a better way
to get the former, you should feel free to skip the latter.

Yes, of course the goal is to support 'proper', non-confusing bidi in
strings and comments in programming languages, and for markup languages
such as XML and HTML, and using before/after-string properties of
overlays would be one means to get there.

Post by Beni Cherniavsky-Paskin

Post by Martin J. DÃ¼rst
Date: Wed, 30 Jun 2010 02:04:30 +0300
I believe that the required infrastructure has a lot in common with the
coloring (font-lock) system.

In an ideal world, we would have a separate "bidi-lock" engine,
hand-tuned for each syntax in the world, parsing it and endowing
characters with whatever overlays/properties the display engine
needs to show it in perfect order.
But that's out of question because it'd require huge *manpower*
to maintain, and only a fraction of the world's programmers care
about bidi. A much more realistic (even if less perfect) approach
is to harness existing polished font-lock engine, defining bidi
in terms of it output (e.g. "characters with font-lock-string-face
should be treated as embeddings and not mix with outside text").
[I'm just saying that'd be cool; see below on feasibility...]

I agree with this general assessment. If there's already some
commonality, if only for naming, for a set of programming languages, it
would be a pity to not reuse that for bidi.

Post by Eli Zaretskii
It's not that simple. The way bidi reordering is designed and
implemented in Emacs, the reordering itself happens _before_ faces,
overlays, and other display features are considered. The bidi
reordering engine is totally oblivious to text properties, overlays,
images, etc.; it just controls which character will be considered next
for delivering it to the display, and all the rest, i.e. calculation
of the face of that character, its display metrics, etc. -- all this
happens _after_ reordering, in code that calls the reordering engine.

That's fine. But we understand that we need some way to tweak/control
some aspects of bidi in order to get decent and reasonably readable
display for program texts and markup languages. So whether it's some
text properties, some overlay properties, or whatever, we need at least
some hook to influence the basic bidi reordering.

Post by Beni Cherniavsky-Paskin
So currently there is no way whatsoever by which lisp code (even the
hypothetical perfect bidi-lock) can affect reordering? (Except for
actually inserting characters into the buffer, which we agree is no-no)
So to get smarter bidi, the bidi engine will have to inspect *some*
kind of metadata on the characters it handles. Now let's separate
1. What bidi metadata exactly do we want?
Emacs has many kinds of metadata (face info, invisibility,
local vars/keymap, hooks...) but we'll need new one(s).
- Virtual control characters?
- An increment to embedding level?
- Overridding character's unicode bidi cathegory?
...
It's an interesting discussion, but I don't have much to say now.
2. How will it be attached to the text?
- Text properties
- Overlays (implemented by pairs of markers?)
Almost all metadata can be attached by both, although a few (e.g.
before/after-string) only work on overlays.
a lot of metadata (such as font size) is not directly attached to
text/overlays but to a face object (which is a property of text/overlay).
Let's concentrate on question 2 here.

Martin J. Dürst

2010-06-30 01:38:34 UTC

Hello Eli, others,

Date: Tue, 29 Jun 2010 17:26:56 +0900
I think I have mentioned this before, but we have been doing some work
in the area of rendering HTML/XML source with bidi text. Please see
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi/ and the links from
there. (it still has quite a few problems, mostly related to editing Web
pages and JavaScript,...)
It also looks like we might get around to work on transposing our
solutions to Emacs this year. It's too early to promise anything, but
we'll try our best. And we will certainly be glad to get help and advice
from this list.

Thanks. You did mention this before, and I've read those pages.

Thanks.

Post by Eli Zaretskii
However, it is difficult to judge their applicability to Emacs,
because virtually nothing is said regarding the implementation, except
that it "uses overlays".

Actually, we haven't done an implementation for Emacs yet, and I don't
think we are saying that we did in any place. If we had done an
implementation already, then we sure would have made it available for
others to use.