topic Re: Text Encoding in .NET Forum

Text Encoding

alex_b — Fri, 08 Aug 2014 13:17:42 GMT

Hi,

I have some legacy drawings with texts using alegacy .shx font, with extended codes for the Hebrew alphabet.

The codes are from 0x80 thru 0x9a for the 26 Hebrew letters.

The texts display fine in R2004 and above.

Now if I try extracting the DBText.TextString of the text object, I get gibberish.

Tried everything to convert the string, without success.

I have a lisp function which performs the needed conversion by adding 0x60 to each character value, but, while it works fine in lisp, it fails in C#.

Am I missing something?

BTW, if I change the text font to a Windows font, it will display the same gibberish and only after running the lisp function mentioned above will it display correctly and then, of course, the extracted string will be OK too.

The following is the kind of string DBText.TextString returns on the original text object (it should be Hebrew, but it's obviously not):

316 „ˆ‘…˜‰ ‡” ‰…”‰–

It seems that TextString forces the extended ASCII to Unicode and doing a poor job.

Thanks,

alex

Re: Text Encoding

Anonymous — Mon, 11 Aug 2014 20:51:01 GMT

Hi Alex,

I guess you need to write your own C# code to loop for each char in a string, if its ASCII code >= 128 (0x80), and <= 154 (0x9a), then add its code with 96 (0x60).

A bigfont with .shx file was introduced with legacy AutoCAD before Unicode time. Therefore its encoding is not matched with Unicode. We may need conversion from .shx to .ttf for character ASCII code >= 128. See the below test code:

public static void TestConvertShxToUnicode()
{
    string text = "˜‰ ‡"; // DBText.TextString
    int convertCode = 96;
    text = ConvertShxToUnicode(text, convertCode);
}

private static string ConvertShxToUnicode(string text, int convertCode, int startCode = 128)
{
    var result = new System.Text.StringBuilder();
    foreach (char c in text)
    {
        char ch = c;
        if (c >= startCode)
        {
            ch = (char)((int)c + convertCode);
        }
        result.Append(ch);
    }
    return result.ToString();
}

You should provide a simple test drawing with Hebrew letters and their .shx font to test.

Re: Text Encoding

alex_b — Tue, 12 Aug 2014 16:14:59 GMT

Hi Khoa,

This is basically just what i'm doing and it surely doesn't work.

I'm trying to just add 0x60 to each char between 0x80 and 0x9a.

Curiously enough, the following lisp code works and the result is as expected:

(defun heb2win (ent) ;entity ent
(setq ed (entget ent)
textval (cdr (assoc 1 ed))
slen (strlen textval)
lptr 1
txtout ""
)
(while (<= lptr slen)
(setq tcr (ascii (substr textval lptr 1)))
(if (and (>= tcr 128) (<= tcr 154)) (setq tcr (+ tcr 96)))
(setq txtout (strcat txtout (chr tcr)))
(setq lptr (+ lptr 1))
);;while
(setq ed (entget e)
ed (subst (cons 1 txtout) (assoc 1 ed) ed)
)
(entmod ed)
(entupd e)
)

I think the framework does some behind-the-scenes guesswork based on the system code page maybe, which alters the string returned by DBText.TextString, while lisp returns the raw chars.

I tried using differrent encodings, and the byte array obtained from the string varies with the encoding, sometimes arratically.

I attach a sample drawing and the two relevant .shx fonts; the third font used is Arial.

The drawing contains three copies of the same text, each one based on a differrent font.

As you can see, the original text, based on the old font is legible, the other two are not because of the missing 0x60 shift.

If you run the lsip routine, it fixes things, while in C# I fail.

Thank you,

alex

Re: Text Encoding

Anonymous — Tue, 12 Aug 2014 20:53:27 GMT

Here is the code to convert your texts from Unicode Latin to Hebrew. The .NET framework does a great job on conversion. Credit to StackOverflow at the link

[CommandMethod("ConvertLatinToHebrew")]
public static void ConvertLatinToHebrew()
{
    Document doc = Application.DocumentManager.MdiActiveDocument;
    Editor editor = doc.Editor;
    Database db = doc.Database;
    using (Transaction trans = db.TransactionManager.StartTransaction())
    {
        var peo = new PromptEntityOptions("Select a text: ");
        peo.SetRejectMessage("\nSelect only text");
        peo.AddAllowedClass(typeof(DBText), true);
        PromptEntityResult result = editor.GetEntity(peo);
        if (result.Status == PromptStatus.OK)
        {
            ObjectId id = result.ObjectId;
            var text = (DBText)trans.GetObject(id, OpenMode.ForWrite);
            string value = text.TextString;

            value = ConvertLatinToHebrew(value);

            text.TextString = value;
            text.DowngradeOpen();
        }
        trans.Commit();
    }
}

[CommandMethod("ConvertLatinToHebrewTest")]
public static void ConvertLatinToHebrewTest()
{
    string latinText = "„ˆ‘…˜‰ ‡” ‰…”‰–";
    string hebrewText = ConvertLatinToHebrew(latinText);
    // hebrewText = "הטסורינ חפ יופיצ";
}

public static string ConvertLatinToHebrew(string latinText)
{
    Encoding latinEncoding = Encoding.GetEncoding("Windows-1252");
    Encoding hebrewEncoding = Encoding.GetEncoding(862); // MS-DOS Hebrew

    byte[] latinBytes = latinEncoding.GetBytes(latinText);

    string hebrewText = hebrewEncoding.GetString(latinBytes);
    return hebrewText;
}

I did try many different ways to encode the text to Hebrew letters and could not find the rule to make mathematic calculations. The ASCII number of characters in the previous code does not help when they are extended Unicode characters. Anyway, .NET helps the conversion to become easier.

Re: Text Encoding

alex_b — Thu, 14 Aug 2014 09:30:41 GMT

Hi Khoa,

Thank you for the code you posted. It works OK, except it makes the hardcoded assumption that the source string is Unicode Latin.

Among other things it means that running the code twice in succession on the same text results in gibberrish.

As the program I write is not interactive, it blindly processes all texts ant therefore it needs to know beforehand if the text's encoding has to be converted.

The lisp function I posted does just that (it looks for a certain range of chars).

The problem is DBText.TextString always returns a Unicode string, seemingly irrespective of the string's encoding in the Autocad database, whereas, to do the same processing the lisp does, I need the ANSI chars, as Autocad sees it.

Do you know of a way to get the string as Autocad sees it and not converted to unicode in an encoding I don't controll?

One more problem is changing a text;s style from a Unicode font to a ANSI one. Again the .NET function results in problems while the lisp function is OK.

I even tried to run the lisp via P/Invoke, but there were no visible results. How does one debug through a lisp invoked from NET? Even (print) statements in the lisp wouldn't work, or do they?

Thanks,

alex