There are a number of threads floating around as to why iText does not render Indian languages properly. The reason is because iText does not handle Ligature Substitution.
In Indian languages like Bangla and Hindi, two or more characters sometimes merge to form a single Glyph.
ক + ্ + ষ = ক্ষ
ক + ্ + ষ + ্ + ম = ক্ষ্ম
ল + ্ + ল = ল্ল
क + ् + ष = क्ष
क + ् + ष + ् = क्ष्म
ल + ् + ल = ल्ल
This essentially means that whenever we get these composite characters, we need to replace them with a single glyph.
In case you see only boxes above, click here. Upgrade your browser to one that can handle Unicode.
This information is available in the OpenTypeFont file(note that OpenTypeFonts can have the extension .ttf, which is also used for TrueTypeFonts). The OpenTypeFont has a table called the GlyphSubstitutionTable (GSUB). Its pretty cryptic and obfuscated, and you have to basically go on a wild goose chase. But after that, you can get a list of the Glyphs that should be replaced by a single Glyph. These specifications can be found here: http://www.microsoft.com/typography/otspec/gsub.htm
The best part about iText is its Open Source.
This is the Github location: https://github.com/itext/itextpdf
At the heart of converting text to PDF is the TrueTypeFont class. This parses the actual FontFile and reads various information like the Character to Glyph mappings (cmap), the Glyph metrics, etc. Then, we have the convertToBytes() method in the FontDetails class, which actually converts each character into the Glyph code and writes it to PDF.
The following is the test harness for testing out my fix.
import java.io.FileOutputStream; import java.io.IOException; import org.junit.Test; import com.itextpdf.text.Document; import com.itextpdf.text.DocumentException; import com.itextpdf.text.Font; import com.itextpdf.text.Paragraph; import com.itextpdf.text.Phrase; import com.itextpdf.text.pdf.BaseFont; import com.itextpdf.text.pdf.PdfWriter; /** * Inspired from http://itextpdf.com/examples/iia.php?id=158 * * @author paawak */ public class BanglaPdfGenerationTest { /** * The unicode of this is given below: * * u0986u09aeu09bf u0995u09cbu09a8 u09aau09a5u09c7 * u0995u09cdu09b7u09c0u09b0u09c7u09b0 u09b7u09a8u09cdu09a1 * u09aau09c1u09a4u09c1u09b2 u09b0u09c1u09aau09cb * u0997u0999u09cdu0997u09be u098bu09b7u09bf * */ private static final String BANGLA_TEXT = "আমি কোন পথে ক্ষীরের লক্ষ্মী ষন্ড পুতুল রুপো গঙ্গা ঋষি"; public void createPdf(String filename) throws DocumentException, IOException { // step 1 Document document = new Document(); // step 2 PdfWriter.getInstance(document, new FileOutputStream(filename)); // step 3 document.open(); // step 4 Paragraph paragraph = new Paragraph(); paragraph.add(new Phrase(BANGLA_TEXT, new Font(BaseFont.createFont("/usr/share/fonts/lohit-bengali/Lohit-Bengali.ttf", BaseFont.IDENTITY_H, true)))); document.add(paragraph); // step 5 document.close(); } @Test public void testGenerate() throws IOException, DocumentException { String fileName = System.getProperty("user.home") + "/a.pdf"; new BanglaPdfGenerationTest().createPdf(fileName); } }
The changes are done on itextpdf-5.4.0-SNAPSHOT, revision 5638. Please note that the below jar will not work in most cases, as it is only half-baked.
If you notice the i-kar, e-kar and o-kar are still not displaying in their proper position. I am convinced that this is because we need to read the Positioning data from the GPOS - The Glyph Positioning Table. That is my next task. Stay tuned!
My code is commented out in the latest iText, as it seems to be interfering with some of their core functionalities.
Download the iText source from Github:
https://github.com/itext/itextpdf
After getting the source, just uncomment the below line in the TrueTypeFontUnicode.java:
@Override void process(byte ttfAfm[], boolean preload) throws DocumentException, IOException { super.process(ttfAfm, preload); //the below line must be uncommented for Indic scripts to work readGsubTable(); }
Building it with maven should be pretty straight forward. Cheers!