Extracting Text and Formatting From PDF Files in Pure Go
Extracting textual content from PDF files can be very useful for a number of reasons. Maybe your company has to parse through years of historical financial data stored in PDF files in a cold backup. Manually parsing these PDF documents would take ages and countless man hours to get it done and even then, the chances of making a mistake while jolting down values is quite high.
A better alternative is to use a library like UniPDF, which has the extractor package that can get the same job done in minutes instead of days, while maintaining high precision. The package not only has the ability to parse through textual content inside PDFs, but can also detect tables and extract them into CSV files. This is efficient if your files have a lot of tables and numerical values that you would rather handle in CSV files.
UniPDF is built using pure Go, what this means is that you can create and compile agents for multiple platforms to perform text extraction on a number of platforms. Go is quite flexible in this sense.
Financial statements are not the only use case of the extractor package. Integrating UniPDF into your hiring process will allow you to weed out inappropriate CVs and scan for important information that you may be looking for in a potential employee. You can also use UniPDF to find and redact personally identifiable information and other sensitive information such as credit card details or detect violent language.
This way, UniPDF can act as the initial screening system for your organisation and save time while hiring new people. You can go one step further and use UniOffice with UniPDF to also cover text search in word documents, excel sheets, and presentation files (docx, xlsx, pptx). This way, you will be able to cover a wide range of file formats and make your screening process very robust.
CV Screening Tool
Let’s try out the second use case highlighted above, through the UniDoc playground. The playground is a brilliant tool that allows you to use all of the functionalities of our libraries without having to install anything on your system. You just need to visit the playground and can try out the library within seconds.
In the following playground, we are in the process of hiring a front-end developer and want to give each application a rating depending upon how closely they meet our requirements. At UniDoc, we believe in hiring the best, which is why we’ll be deciding between Tony Stark and Bruce Wayne.
The rating is out of 5, where 5 means that the applicant is a perfect fit for the job. We have also set some hard conditions here such as we are only looking for candidates in the United States. So, if their address does not exist in the US then they are not included in the results.
Here’s the console output from the example:
-------------------------------------
File: Resume-Tony_Stark.pdf
Candidate is Suitable, with score: 5/5
-------------------------------------
File: Resume_Bruce_Wayne.pdf
Candidate is Suitable, with score: 5/5
Now this was a very basic example of how you can use the extractor package to get the text and perform some simple comparison. If you want to go one step further, you can take a look at the example below.
In this advanced version of the candidate screening tool, we are not just determining whether the filters exist but also finding their locations in the PDF file. We achieve this by using the getting the text marks from the PDF and creating the appropriate search algorithm.
Here’s the console output produced by the example:
-------------------------------------
Prescreening results:
-------------------------------------
Resume-Tony_Stark.pdf
Detected Title: Tony Stark
Email Address: [email protected]
Candidate's Score: 4 / 5
States: State: CA
Skills: JavaScript, CSS, C#, PHP,
-------------------------------------
Prescreening results:
-------------------------------------
Resume-Bruce_Wayne.pdf
Detected Title: Bruce Wayne
Email Address: [email protected]
Candidate's Score: 5 / 5
States: State: MA
Skills: JavaScript, JS, JS, jQuery, Angular, AngularJS, Python, C++, C#, ASP.NET, TDD, Agile, Git,
The result of this playground example can be viewed by opening the playground or by viewing the image below. Bounding boxes have been used to highlight the relevant information. Different colours indicate different kinds of information, where green indicates front-end skills, and yellow is used to highlight the state.
Regex has been used to find the email address from the text and for the name, we assume that it has the largest font size in the resume.
Conclusion
The tasks performed in the playground example shown above would have taken a person a few minutes to complete. But, our code gets the job done in seconds. Now, if the number of resumes were in the hundreds then the time would’ve increased exponentially for a person to perform this task. However, our code would’ve simply taken a few more seconds.
The UniPDF library is built using GoLang, it’s fast and is easy to learn. We are constantly improving our libraries to meet and if you want to add a new feature then open an issue on UniPDF’s repository. You can also check out tons of examples that will help you get started using UniPDF. We are also building a knowledge base and you can check out the KB article for this topic here.