Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Native Bengali here. The ligature used for the last letter of "haTaat" (suddenly) is not the same as the last ligature in "aditya" - the latter doesn't have the circle at the top.

More generally, using the vowel silencing diacritic (hasanta) along with a separate ligature for the vowel ending - while theoretically correct - does not work because no one writes that way! Not using the proper ligatures makes the test essentially unreadable.



I don't understand Bengali at all. I'm trying to understand your second sentence though. When you say "no one writes that way", do you mean nobody hits the keys for letter, followed by vowel-silencing diacritic, followed by another vowel? Or do you mean the glyph that results from that combination of keystrokes doesn't match how a Bengali speaker would write it on paper?

If it's the latter, isn't that an issue for the text input system to deal with? Unicode does not need to represent how users input text, it merely needs to represent the content of the text. For example, in OS X, if I press option-e for U+0301 COMBINING ACUTE ACCENT, and then type "e", the resulting text is not U+0301 U+0065. It's actually U+00E9 (LATIN SMALL LETTER E WITH ACUTE), which can be decomposed into U+0065 U+0301. And in both forms (NFC and NFD), the unicode codepoint sequence does not match the order that I pressed the keys on the keyboard.

So given that, shouldn't this issue be solved for Bengali at the text input level? If it makes sense to have a dedicated keystroke for some ligature, that can be done. Or if it makes sense to have a single keystroke that adds both the vowel-silencing diacritic + vowel ending, that can be done as well.

---

If the previous assumption was wrong and the issue here is that the rendered text doesn't match how the user would have written it on paper, then that's a different issue. But (again, without knowing anything about Bengali so I'm running on a lot of assumptions here) is that still Unicode's fault? Or is it the fault of the font in question for not having a ligature defined that produces the correct glyph for that sequence of codepoints?


It has to do with how the text is rendered. For example, if you see the Bengali text on page 3 of this PDF:

http://www.unicode.org/L2/L2004/04252-khanda-ta-review.pdf

it is unreadable and incorrect Bengali. ;-)


But Unicode emphatically does not define a rendering, a glyph.

To me it sounds like the rendering needs to be fixed, not Unicode.


Unicode is about _code points_, not what is actually rendered. It recommends what should be shown, but it's up to the font on how to render it.


On page 3 I see two different pieces of Bengali. One is text, and the other is an image. I assume you're referring to the text? What makes it wrong? And what software are you using to view the PDF? If it's wrong, it's quite possible that the software you're using doesn't render it correctly, rather than the document actually being wrong.


I am referring to the first line of Bengali text.

I have viewed it in Acrobat Reader, epdfview, Chrome's PDF Reader, Firefox's pdf reader and my iPhone's pdf reader.

The problems are that joint-letter ligatures are not used, and several vowels signs are placed after the consonant when they should have been placed before them.


Your description makes it sound like the text in the PDF is merely incorrect, rather than there being any issue with Unicode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: