I had recently been to a datathon (A hackathon related to data science) in PES University, Bangalore. There my team was given a task to extract data from the National Achievement Survey - 2017 conducted by NCERT.
NAS collects data about CBSE schools across states and districts of India to collect data about student achievements and their overall reports. This data is present in PDF formats.
We were tasked to extract data from PDF and tabulate it.
$ pdftotext
is a linux utility to convert pdf to text. By supplying a -layout option the default layout of the data is mostly preserved.
I made a python script (pdf_convert.py) to convert the pdf data to text files sequentially.
Next I made a script to convert the text files to csv data. So each text file was turned to a record (row) in the csv file.
Here is a snapshot of the directory structure of PDF file that we got.
.
├── Andaman & Nicobar Islands
│ ├── Andaman
│ │ ├── Andamans Class - 3 (EVS) Report Card.pdf
│ │ ├── Andamans Class - 3 (Language) Report Card.pdf
│ │ ├── Andamans Class - 3 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 5 (EVS) Report Card.pdf
│ │ ├── Andamans Class - 5 (Language) Report Card.pdf
│ │ ├── Andamans Class - 5 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 8 (Language) Report Card.pdf
│ │ ├── Andamans Class - 8 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 8 (Science) Report Card.pdf
│ │ ├── Andamans Class - 8 (SST) Report Card.pdf
You can find my code at my github Repo: https://github.com/divyaksh-shukla/NAS_pdf_extract
For now this code is only optimised to run on the pdf files on a single folder. But I eventually plan to make it modular for everybody to use.
NAS collects data about CBSE schools across states and districts of India to collect data about student achievements and their overall reports. This data is present in PDF formats.
We were tasked to extract data from PDF and tabulate it.
$ pdftotext
is a linux utility to convert pdf to text. By supplying a -layout option the default layout of the data is mostly preserved.
I made a python script (pdf_convert.py) to convert the pdf data to text files sequentially.
Next I made a script to convert the text files to csv data. So each text file was turned to a record (row) in the csv file.
Here is a snapshot of the directory structure of PDF file that we got.
.
├── Andaman & Nicobar Islands
│ ├── Andaman
│ │ ├── Andamans Class - 3 (EVS) Report Card.pdf
│ │ ├── Andamans Class - 3 (Language) Report Card.pdf
│ │ ├── Andamans Class - 3 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 5 (EVS) Report Card.pdf
│ │ ├── Andamans Class - 5 (Language) Report Card.pdf
│ │ ├── Andamans Class - 5 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 8 (Language) Report Card.pdf
│ │ ├── Andamans Class - 8 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 8 (Science) Report Card.pdf
│ │ ├── Andamans Class - 8 (SST) Report Card.pdf
You can find my code at my github Repo: https://github.com/divyaksh-shukla/NAS_pdf_extract
For now this code is only optimised to run on the pdf files on a single folder. But I eventually plan to make it modular for everybody to use.
ReplyDeleteVery Nice Article:
Hi Readers, If you wanna track your expenses and save money online or offline, Just Download Timelybills.app. It will show you all the ways to manage your expenditure without consuming your whole day. Please feel free to check out here. Thank you!
Timelybills.app