Unicode Category of Glyph



  • Hey there,

    I am looking for the best/easiest way to get the unicode and the unicode category of a glyph. The unicode itself is part of the RGlyph object, but not the unicode category, right?

    I tried investigating @erik’s glyphNameFormatter and @jensRFUnicodeInfo, but did not find the right thing yet. Should I be looking elsewhere?

    Thanks in advance for your help!


  • admin

    hello @benedikt

    to get the unicode for a glyph you can use RGlyph.autoUnicodes.

    to get unicode categories I would try using the data provided by the Unicode Consortium. I’ve found these:

    you can write a script to load data from Categories.txt, and then search for the unicode value to get its category.

    hope this helps! let us know if it works…


  • admin

    there’s a catch: in order look up the category, you’ll need to convert the unicode value from integer to hex:

    # load categories data from txt file
    filePath = '/Users/gferreira/Desktop/Categories.txt'
    with open(filePath, 'r') as f:
        rawData = f.readlines()
    
    # convert raw data into dict
    categories = {}
    for line in rawData:
        uni, gc, level1, level2, level3, level4, name = line.split('\t')
        categories[uni] = level1, level2, level3, level4
    
    # get unicode for glyph
    g = CurrentGlyph()
    g.autoUnicodes()
    print(g.name)
    print(g.unicode)
    
    if g.unicode is not None:
    
        # convert unicode integer to hexadecimal
        uni = "%X" % g.unicode
        uni = uni.zfill(4)
        print(uni)
    
        # get category for unicode value
        if uni in categories:
            print(categories[uni])
    
    >>> fi
    >>> 64257
    >>> FB01
    >>> ('Letter', 'Ligature', '', '')
    

  • admin

    I guess glyphNameFormatter has the data you are looking for 🙂

    from glyphNameFormatter import GlyphName
    
    g = GlyphName(65)
    print(g.uniRangeName)
    g = GlyphName(1234)
    print(g.uniRangeName)
    

  • admin

    @frederik this returns the unicode range for a glyph, not the unicode category 🙂

    for example, the first codepoint in your script belongs to the Basic Latin range, and the second to Cyrillic. but both glyphs belong to the category Letter.


  • admin

    try:

    from glyphNameFormatter.data import unicodeCategories
    
    print(unicodeCategories[65])
    


  • Wow, awesome. That’s exactly what I was looking for. Thanks for the pointers!



  • @benedikt
    FWIW, the glyphNameFormatter in RF has a couple of functions that might be useful. It uses the glyphNamesToUnicodeAndCategories.txt names list which is buried somewhere in RF. This is not the full unicode list (I think CJ and K is not included), but it contains a lot of good stuff.

    import glyphNameFormatter.reader
    name = "flyingSaucer"
    value = 0x1F6F8
    
    # unicode to name
    print(glyphNameFormatter.reader.u2n(value))
    > "flyingSaucer"
    
    # name to unicode
    print(glyphNameFormatter.reader.n2u(name))
    > 128760
    
    # unicode to category
    print(glyphNameFormatter.reader.u2c(value))
    > "So"
    
    # name to category
    print(glyphNameFormatter.reader.n2c(name))
    > "So"
    

Log in to reply
 

Looks like your connection to RoboFont ● Forum was lost, please wait while we try to reconnect.