Working with data on a PDF

I have been working on a proposal regarding child marriage in Nepal. My task was to identify the prevalence of child marriage among people who are already married. In plain English, I had to determine what percentage of the married population was married before the age of 19. We chose 19 because there was census data available and the laws of Nepal state that you can marry by age 18 with parental consent or age 20 without parental consent. I also found a factoid stating that UNICEF chooses age 19 as well.

##Finding the Data##
I knew that the 2011 census national report had information regarding age at first marriage among the married population (Table 19). This report is published as a PDF file and available on the Central Bureau of Statistics website. Specifically, this information is available in Table 19, it’s published in English so I should be able to copy and paste it into Excel. Not so fast.

What I saw on the page:
Table 19 Screenshot of PDF

What I saw when I pasted the data into Excel:
Table 19 Pasted in Excel (Screenshot)
The totals came through fine, separated by a space, but the gender breakout (rows 2 and 3) showed up as blocks.

##Identifying the Problem##
I could clearly tell that there were characters there, but the computer couldn’t read them. I saved the pasted file to a text file and saved it as a .csv. I then used the text import wizard in Excel to see if the data was encoded in UTF-8. Note that I’m working with Excel 2010 that was purchased in the US which encodes documents in ANSI. (I had to Google this to find out.) I tried the text import wizard and chose UTF-8 to see if it would import and I still got the blocks.

The next step was to identify the document properties of the PDF (File>Properties>Fonts Tab) to see the encoding types of the file. There it was “Calibri: Encoding:Custom”
PDF Font Properties

I don’t know how to determine custom encoding types. I’m sure machines could figure it out, but I didn’t have time to figure it out. I had two choices:
1) Type the entire data table
2) See if I copy and paste would work

##The Copy and Paste Solution##
The computer knows that there is a character there, but rendering it in a human readable format is another issue. I copied all of Table 19 into my favorite text editor Notepad++ so I could work on the raw text. I copied the first block that should be the letter ‘M’ for the word male. I then did a ctrl+h to do a find and replace. I pasted the copied block into the find box and typed the letter ‘M’ into the replace-with box. I replaced all and it worked. Now I had the letter M converted. I did this for the rest of the word male and the entire word female. Things were coming around. The next step was to identify the numbers. I found the number 1 on the PDF page, found the corresponding encoded block in Notepad++ and did a replace all. I did this for all numbers 1-9 and the comma that separated them. After cleaning up the extra page numbers and table headings I was able to structure the data in a way that was more easy to manipulate and get it into Excel for analysis.

##Paying It Forward##
I believe in open source software and access to useable data. Now that I had done this work, I decided it would be appropriate to share it with others and sent it to Pavitra and Sujata at OpenNepal.net. I have used their VDC and district datasets for other projects and find their service incredibly valuable.


Contact me if you'd like to talk about this post.

 