Thursday, March 31, 2011

How do I determine file encoding in OSX?

I'm trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn't seem to understand them. Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I've never seen before: an "@" by the file listing:

-rw-r--r--@  1 me      users      2021 Feb 11 18:05 my_file.tex

(And, yes, I'm using \usepackage[utf8]{inputenc} in the LaTeX.)

I've found iconv, but that doesn't seem to be able to tell me what the encoding is -- it'll only convert once I figure it out.

From stackoverflow
  • The @ means that the file has extended file attributes associated with it. You can query them using the getxattr() function.

    There's no definite way to detect the encoding of a file. Read this answer, it explains why.

    There's a command line tool, enca, that attempts to guess the encoding. You might want to check it out.

    James A. Rosen : I was assuming that OSX stored the encoding as meta-data. I understood the file contents were just a cluster of bits and had no inherent encoding.
  • Which LaTeX are you using? When I was using teTeX, I had to manually download the unicode package and add this to my .tex files:

    % UTF-8 stuff
    \usepackage[notipa]{ucs}
    \usepackage[utf8x]{inputenc}
    \usepackage[T1]{fontenc}
    

    Now, I've switched over to XeTeX from the TeXlive 2008 package (here), it is even more simple:

    % UTF-8 stuff
    \usepackage{fontspec}
    \usepackage{xunicode}
    

    As for detection of a file's encoding, you could play with file(1) (but it is rather limited) but like someone else said, it is difficult.

  • A brute-force way to check the encoding might just be to check the file in a hex editor or similar. (or write a program to check) Look at the binary data in the file. The UTF-8 format is fairly easy to recognize. All ASCII characters are single bytes with values below 128 (0x80) Multibyte sequences follow the pattern shown in the wiki article

    If you can find a simpler way to get a program to verify the encoding for you, that's obviously a shortcut, but if all else fails, this would do the trick.

  • Classic 8-bit LaTeX is very restricted in which UTF8 characters it can use; it's highly dependent on the encoding of the font you're using and which glyphs that font has available.

    Since you don't give a specific example, it's hard to know exactly where the problem is — whether you're attempting to use a glyph that your font doesn't have or whether you're not using the correct font encoding in the first place.

    Here's a minimal example showing how a few UTF8 characters can be used in a LaTeX document:

    \documentclass{article}
    \usepackage[T1]{fontenc}
    \usepackage{lmodern}
    \usepackage[utf8]{inputenc}
    \begin{document}
    ‘Héllø—thêrè.’
    \end{document}
    

    You may have more luck with the [utf8x] encoding, but be slightly warned that it's no longer supported and has some idiosyncrasies compared with [utf8] (as far as I recall; it's been a while since I've looked at it). But if it does the trick, that's all that matters for you.

  • The @ sign means the file has extended attributes. xattr file shows what attributes it has, xattr -l file shows the attribute values too (which can be large sometimes — try e.g. xattr /System/Library/Fonts/HelveLTMM to see an old-style font that exists in the resource fork).

  • Typing file myfile.tex in a terminal can sometimes tell you the encoding and type of file using a series of algorithms and magic numbers. It's fairly useful but don't rely on it providing concrete or reliable information.

    A Localizable.strings file (found in localised Mac OS X applications) is typically reported to be a UTF-16 C source file.

  • Using the -I (that's a capital i) option on the file command seems to show the file encoding.

    file -I {filename}

    Lo'oris : wrong: -i, not -I
    Casebash : I needed to use -I

0 comments:

Post a Comment