Making iText work with Indic scripts - The Tech Tales: Chronicles of my tech journey

Why iText does not work properly for Indic Scripts?

There are a number of threads floating around as to why iText does not render Indian languages properly. The reason is because iText does not handle Ligature Substitution.

What is Ligature Substitution?

In Indian languages like Bangla and Hindi, two or more characters sometimes merge to form a single Glyph.

Bangla Example:

ক + ্ + ষ = ক্ষ
ক + ্ + ষ + ্ + ম = ক্ষ্ম
ল + ্ + ল = ল্ল

Hindi example

क + ् + ष = क्ष
क + ् + ष + ् = क्ष्म
ल + ् + ल = ल्ल
This essentially means that whenever we get these composite characters, we need to replace them with a single glyph.
In case you see only boxes above, click here. Upgrade your browser to one that can handle Unicode.

Where do we get the information about which Glyphs are to be substituted?

This information is available in the OpenTypeFont file(note that OpenTypeFonts can have the extension .ttf, which is also used for TrueTypeFonts). The OpenTypeFont has a table called the GlyphSubstitutionTable (GSUB). Its pretty cryptic and obfuscated, and you have to basically go on a wild goose chase. But after that, you can get a list of the Glyphs that should be replaced by a single Glyph. These specifications can be found here: http://www.microsoft.com/typography/otspec/gsub.htm

Inner workings of iText

The best part about iText is its Open Source.
This is the Github location: https://github.com/itext/itextpdf
At the heart of converting text to PDF is the TrueTypeFont class. This parses the actual FontFile and reads various information like the Character to Glyph mappings (cmap), the Glyph metrics, etc. Then, we have the convertToBytes() method in the FontDetails class, which actually converts each character into the Glyph code and writes it to PDF.

Integration of the GlyphSubstitutionTable data with iText

The GlyphSubstitutionTableReader class parses the FontFile and gleans the Glyph substitution information, and returns a Map, where the key is the String of composite characters and value is the Glyph object.
Then, in the FontDetails::convertToBytes() method, tokenise the input String based on the composite characters.
Replace the composite characters by their respective Glyphs.
For characters that do not need substitution, proceed normally and replace them with their corresponding Glyph.

Test Harness

The following is the test harness for testing out my fix.

import java.io.FileOutputStream;
import java.io.IOException;
import org.junit.Test;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Font;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.Phrase;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfWriter;
/**
 * Inspired from http://itextpdf.com/examples/iia.php?id=158
 *
 * @author paawak
 */
public class BanglaPdfGenerationTest {
    /**
     * The unicode of this is given below:
     *
     * u0986u09aeu09bf u0995u09cbu09a8 u09aau09a5u09c7
     * u0995u09cdu09b7u09c0u09b0u09c7u09b0 u09b7u09a8u09cdu09a1
     * u09aau09c1u09a4u09c1u09b2 u09b0u09c1u09aau09cb
     * u0997u0999u09cdu0997u09be u098bu09b7u09bf
     *
     */
    private static final String BANGLA_TEXT = "আমি কোন পথে ক্ষীরের লক্ষ্মী ষন্ড পুতুল রুপো গঙ্গা ঋষি";
    public void createPdf(String filename) throws DocumentException, IOException {
        // step 1
        Document document = new Document();
        // step 2
        PdfWriter.getInstance(document, new FileOutputStream(filename));
        // step 3
        document.open();
        // step 4
        Paragraph paragraph = new Paragraph();
        paragraph.add(new Phrase(BANGLA_TEXT, new Font(BaseFont.createFont("/usr/share/fonts/lohit-bengali/Lohit-Bengali.ttf", BaseFont.IDENTITY_H, true))));
        document.add(paragraph);
        // step 5
        document.close();
    }
    @Test
    public void testGenerate() throws IOException, DocumentException {
        String fileName = System.getProperty("user.home") + "/a.pdf";
        new BanglaPdfGenerationTest().createPdf(fileName);
    }
}

Before Fix

After Fix

Source

The changes are done on itextpdf-5.4.0-SNAPSHOT, revision 5638. Please note that the below jar will not work in most cases, as it is only half-baked.

The sources are here
The patch is here
The jar file is here

Next Steps

If you notice the i-kar, e-kar and o-kar are still not displaying in their proper position. I am convinced that this is because we need to read the Positioning data from the GPOS - The Glyph Positioning Table. That is my next task. Stay tuned!

Update: Why is the latest iText still not working?

My code is commented out in the latest iText, as it seems to be interfering with some of their core functionalities.

How do I make it work?

Download the iText source from Github:
https://github.com/itext/itextpdf
After getting the source, just uncomment the below line in the TrueTypeFontUnicode.java:

@Override
void process(byte ttfAfm[], boolean preload) throws DocumentException, IOException {
super.process(ttfAfm, preload);
//the below line must be uncommented for Indic scripts to work
readGsubTable();
}

Building it with maven should be pretty straight forward. Cheers!

← Previous Post Next Post →