Skip to main content

NAS (National Achivement Survey) data extraction

I had recently been to a datathon (A hackathon related to data science) in PES University, Bangalore. There my team was given a task to extract data from the National Achievement Survey - 2017 conducted by NCERT.
NAS collects data about CBSE schools across states and districts of India to collect data about student achievements and their overall reports. This data is present in PDF formats.
We were tasked to extract data from PDF and tabulate it.

$ pdftotext
is a linux utility to convert pdf to text. By supplying a -layout option the default layout of the data is mostly preserved.
I made a python script (pdf_convert.py) to convert the pdf data to text files sequentially.

Next I made a script to convert the text files to csv data. So each text file was turned to a record (row) in the csv file.

Here is a snapshot of the directory structure of PDF file that we got.
.
├── Andaman & Nicobar Islands
│   ├── Andaman
│   │   ├── Andamans Class - 3 (EVS)  Report Card.pdf
│   │   ├── Andamans Class - 3 (Language)  Report Card.pdf
│   │   ├── Andamans Class - 3 (Mathematics)  Report Card.pdf
│   │   ├── Andamans Class - 5 (EVS)  Report Card.pdf
│   │   ├── Andamans Class - 5 (Language)  Report Card.pdf
│   │   ├── Andamans Class - 5 (Mathematics)  Report Card.pdf
│   │   ├── Andamans Class - 8 (Language)  Report Card.pdf
│   │   ├── Andamans Class - 8 (Mathematics)  Report Card.pdf
│   │   ├── Andamans Class - 8 (Science)  Report Card.pdf
│   │   ├── Andamans Class - 8 (SST)  Report Card.pdf

You can find my code at my github Repo: https://github.com/divyaksh-shukla/NAS_pdf_extract

For now this code is only optimised to run on the pdf files on a single folder. But I eventually plan to make it modular for everybody to use.

Comments


  1. Very Nice Article:
    Hi Readers, If you wanna track your expenses and save money online or offline, Just Download Timelybills.app. It will show you all the ways to manage your expenditure without consuming your whole day. Please feel free to check out here. Thank you!

    Timelybills.app

    ReplyDelete

Post a Comment

Popular posts from this blog

Running Oracle Outsidein Technology (OIT) in Docker

Oracle Outsidein Technology provides a set of tools and SDKs to convert many forms of data to readable documents. It also holds data extraction and reduction capabilities. Used by multiple firms and tech fronts for more than 3 decades, OIT is a vast product here to stay for a long time. Here, I made a starting attempt on using OIT's Image export using docker. Created a small docker file with the oracle-java8 base image, loaded the image export jars and dependencies, created a few mounting volumes and ran it on a single sample pdf file. An enriching and learning experience for me and my father all the same. This blog shows the steps I took to get image export working on my machine. Prerequisites Docker (to be installed on your machine, you can go to docker's website and download it for your OS windows/linux/mac) OIT Image Export SDK (Get it here ) The image export SDK also contains some sam...

Arduino with 7-Segment Display (LT542)

A 7-segment display is a LED-LCD display with 8 LCD cells are controlled by 8 pins. Usually a 7-segment display has 10 pins, 2 are common pins and the rest 8 control each LCD cell. Now, a 7-segment display is of 2 types, common cathodic and common anodic display. While the common cathodic display has its common pins hooked up to the ground(GND), the common anodic display has its common pins hooked up to high voltage(+5). A diagram explaining this is given. I have used a display numbered LT542 which is a common anodic display. This project is aimed to control the LT542 to display each number from 0 to 9 at a second's gap. MATERIALS REQUIRED: Arduino Uno LT542 Jumper wires (male-to-male) Breadboard STEPS: Wire the setup as shown in the schematic and pictures. Copy the code given and paste it into the arduino IDE. Plug in your board tho the computer using a USB cable. Upload the code. CODE:  /**   PINOUT DIAGRAM FOR THIS CODE   ...