Unicode Category of Glyph

benedikt

Hey there,

I am looking for the best/easiest way to get the unicode and the unicode category of a glyph. The unicode itself is part of the RGlyph object, but not the unicode category, right?

I tried investigating @erik’s glyphNameFormatter and @jens’ RFUnicodeInfo, but did not find the right thing yet. Should I be looking elsewhere?

Thanks in advance for your help!

frederik

try:

from glyphNameFormatter.data import unicodeCategories

print(unicodeCategories[65])

erik

@benedikt
FWIW, the glyphNameFormatter in RF has a couple of functions that might be useful. It uses the glyphNamesToUnicodeAndCategories.txt names list which is buried somewhere in RF. This is not the full unicode list (I think CJ and K is not included), but it contains a lot of good stuff.

import glyphNameFormatter.reader
name = "flyingSaucer"
value = 0x1F6F8

# unicode to name
print(glyphNameFormatter.reader.u2n(value))
> "flyingSaucer"

# name to unicode
print(glyphNameFormatter.reader.n2u(name))
> 128760

# unicode to category
print(glyphNameFormatter.reader.u2c(value))
> "So"

# name to category
print(glyphNameFormatter.reader.n2c(name))
> "So"

benedikt

Wow, awesome. That’s exactly what I was looking for. Thanks for the pointers!

frederik

try:

from glyphNameFormatter.data import unicodeCategories

print(unicodeCategories[65])

gferreira

@frederik this returns the unicode range for a glyph, not the unicode category :)

for example, the first codepoint in your script belongs to the Basic Latin range, and the second to Cyrillic. but both glyphs belong to the category Letter.

frederik

I guess glyphNameFormatter has the data you are looking for :)

from glyphNameFormatter import GlyphName

g = GlyphName(65)
print(g.uniRangeName)
g = GlyphName(1234)
print(g.uniRangeName)

gferreira

there’s a catch: in order look up the category, you’ll need to convert the unicode value from integer to hex:

# load categories data from txt file
filePath = '/Users/gferreira/Desktop/Categories.txt'
with open(filePath, 'r') as f:
    rawData = f.readlines()

# convert raw data into dict
categories = {}
for line in rawData:
    uni, gc, level1, level2, level3, level4, name = line.split('\t')
    categories[uni] = level1, level2, level3, level4

# get unicode for glyph
g = CurrentGlyph()
g.autoUnicodes()
print(g.name)
print(g.unicode)

if g.unicode is not None:

    # convert unicode integer to hexadecimal
    uni = "%X" % g.unicode
    uni = uni.zfill(4)
    print(uni)

    # get category for unicode value
    if uni in categories:
        print(categories[uni])

>>> fi
>>> 64257
>>> FB01
>>> ('Letter', 'Ligature', '', '')

gferreira

hello @benedikt

to get the unicode for a glyph you can use RGlyph.autoUnicodes.

to get unicode categories I would try using the data provided by the Unicode Consortium. I’ve found these:

you can write a script to load data from Categories.txt, and then search for the unicode value to get its category.

hope this helps! let us know if it works…

SOLVED Unicode Category of Glyph