Organizational Research By

Surprising Reserch Topic

Question:iTextSharp: reading radio button, check box states from a non-form PDF


have a pdf document (created by a 3rd party using RealObjects PDFreactor), which is not a form. I'm trying to extract information from that PDF using iTextSharp. I'm able to extract all the plain text (using SimpleTextExtractionStrategy), but there is some information that is represented with radio buttons, which does not come across in the plain text extracted. I'm a complete beginner with iTextSharp, so I might be overlooking something very simple. PdfReader.AcroForm returns null, and PdfReader.AcroFields.Fields has 0 keys. How can I figure out the state of radio buttons and checkboxes throughout the document? The documents all have the same structure, so I don't really need the radio buttons & checkboxes to be labelled; just having a list of items, and knowing whether they are checked or not would be sufficient.

I've confirmed that the radio buttons are not just images using this approach.

I made a feeble attempt at finding all the BTN objects on a page by modifying the code to extract images, but I apparently messed up because it doesn't return anything on a page that contains radio buttons:

private List GetBTNFromPdfDict(PdfDictionary dict, PdfReader doc)
{
    List objects = new List();

    foreach (PdfName name in dict.Keys)
    {
        PdfObject obj = dict.Get(name);
        PdfDictionary tg = PdfReader.GetPdfObject(obj) as PdfDictionary;
        if (null != tg)
        {
            PdfName subtype = (PdfName)(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)));

            if (obj.IsIndirect())
            {
                if (PdfName.BTN.Equals(subtype))
                {
                    int xrefIdx = ((PRIndirectReference)obj).Number;
                    PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
                    objects.Add(pdfObj);
                }
                else if (PdfName.FORM.Equals(subtype) || PdfName.GROUP.Equals(subtype))
                {
                    objects.AddRange(GetBTNFromPdfDict(tg, doc));
                }
            }
            else objects.AddRange(GetBTNFromPdfDict(tg, doc));                    
        }
    }

    return objects;
}

I also looked at the content for the page using PdfContentReaderTool; the output didn't include any BTN, so I'm not sure what is going on. Here's the output (I replaced all the lines of text with [TEXT]). Just looking at the content stream section, it seems trivial to extract all strings (everything between brackets followed by Tj), but I can't seem to figure out radio buttons and checkboxes. (I had to abridge the output, the question was too long)

==============Page 2====================
- - - - - Dictionary - - - - - -
(/Type=/Page, /TrimBox=System.Collections.Generic.List`1[iTextSharp.text.pdf.PdfObject], /Contents=Stream, /Parent=Dictionary of type: /Pages, /Group=Dictionary of type: /Group, /BleedBox=System.Collections.Generic.List`1[iTextSharp.text.pdf.PdfObject], /Resources=Dictionary, /MediaBox=System.Collections.Generic.List`1[iTextSharp.text.pdf.PdfObject])
    Subdictionary /Parent = (/Count=10, /Type=/Pages, /Parent=Dictionary of type: /Pages, /Kids=System.Collections.Generic.List`1[iTextSharp.text.pdf.PdfObject])
        Subdictionary /Parent = (/Count=39, /Type=/Pages, /ITXT=2.1.6, /Kids=System.Collections.Generic.List`1[iTextSharp.text.pdf.PdfObject])
    Subdictionary /Group = (/Type=/Group, /S=/Transparency, /CS=/DeviceRGB)
    Subdictionary /Resources = (/ColorSpace=Dictionary, /ProcSet=System.Collections.Generic.List`1[iTextSharp.text.pdf.PdfObject], /Font=Dictionary)
        Subdictionary /ColorSpace = (/CS=/DeviceRGB)
        Subdictionary /Font = (/F1=Dictionary of type: /Font, /F3=Dictionary of type: /Font, /F2=Dictionary of type: /Font, /F4=Dictionary of type: /Font)
            Subdictionary /F1 = (/Type=/Font, /BaseFont=/Helvetica, /Subtype=/Type1, /Encoding=/WinAnsiEncoding)
            Subdictionary /F3 = (/Type=/Font, /BaseFont=/Times-Bold, /Subtype=/Type1, /Encoding=/WinAnsiEncoding)
            Subdictionary /F2 = (/Type=/Font, /BaseFont=/Times-Roman, /Subtype=/Type1, /Encoding=/WinAnsiEncoding)
            Subdictionary /F4 = (/Type=/Font, /BaseFont=/Times-Italic, /Subtype=/Type1, /Encoding=/WinAnsiEncoding)
- - - - - XObject Summary - - - - - -
No XObjects
- - - - - Content Stream - - - - - -
q
BT
36 805.89 Td
ET
Q
q
0 841.89 m
0 0 l
595.29 0 l
595.29 841.89 l
h
W
n
1 w
2 J
0 j
10 M
asked Sep 13, 2013 in Java Interview Questions by anonymous
edited Sep 12, 2013
0 votes
396 views



Related Hot Questions



Government Jobs Opening


...