NAS (National Achivement Survey) data extraction

May 08, 2018

I had recently been to a datathon (A hackathon related to data science) in PES University, Bangalore. There my team was given a task to extract data from the National Achievement Survey - 2017 conducted by NCERT.
NAS collects data about CBSE schools across states and districts of India to collect data about student achievements and their overall reports. This data is present in PDF formats.
We were tasked to extract data from PDF and tabulate it.

$ pdftotext
is a linux utility to convert pdf to text. By supplying a -layout option the default layout of the data is mostly preserved.
I made a python script (pdf_convert.py) to convert the pdf data to text files sequentially.

Next I made a script to convert the text files to csv data. So each text file was turned to a record (row) in the csv file.

Here is a snapshot of the directory structure of PDF file that we got.
.
├── Andaman & Nicobar Islands
│ ├── Andaman
│ │ ├── Andamans Class - 3 (EVS) Report Card.pdf
│ │ ├── Andamans Class - 3 (Language) Report Card.pdf
│ │ ├── Andamans Class - 3 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 5 (EVS) Report Card.pdf
│ │ ├── Andamans Class - 5 (Language) Report Card.pdf
│ │ ├── Andamans Class - 5 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 8 (Language) Report Card.pdf
│ │ ├── Andamans Class - 8 (Mathematics) Report Card.pdf
│ │ ├── Andamans Class - 8 (Science) Report Card.pdf
│ │ ├── Andamans Class - 8 (SST) Report Card.pdf

You can find my code at my github Repo: https://github.com/divyaksh-shukla/NAS_pdf_extract

For now this code is only optimised to run on the pdf files on a single folder. But I eventually plan to make it modular for everybody to use.

Comments

NandhaKumar22 December 2020 at 17:59

Very Nice Article:
Hi Readers, If you wanna track your expenses and save money online or offline, Just Download Timelybills.app. It will show you all the ways to manage your expenditure without consuming your whole day. Please feel free to check out here. Thank you!

Timelybills.app
ReplyDelete
Replies

Add comment

Divyaksh Shukla Blog

NAS (National Achivement Survey) data extraction

Comments

Post a Comment

Popular posts from this blog

Arduino with 7-Segment Display (LT542)

Running Oracle Outsidein Technology (OIT) in Docker