UiPath, one of the big providers of robotic process automation software, has some very interesting positioning. Unlike the other players on the market, they provide a free and fully featured community edition of their product for anybody to test and develop. The tool automates any application and is packed with all the web scraping and screen scraping capabilities for both desktop and web. The platform also has a lively community forum featuring jobs, automation contests and knowledge-sharing between UiPath users: www.forum.uipath.com.
The PDF data extraction (extraction from pdf) and automation feature tool offers several activities and methods to navigate, identify and use PDF data freely whether in native text format or scanned images. The full featured IDE has a graphical interface with straightforward drag-and-drop functionality and a built-in library of predefined ‘Activities’.
To start things off, you need all the actions and dependencies required for working with PDF files. You can install the ‘UiPath PDF Activities’ package from the Package Manager. A simple search for ‘PDF’ inside the Package Manager will get you there.
1. Extract larger pieces of text or entire documents
These three techniques can be used to extract larger pieces of text or entire documents.
Read PDF Text activity
For this action, the PDF file doesn’t need to be open. You simply select the file and the Action will output a text variable with the contents of the file. You can save the result as a text file and also show it in a message box, but you could use other string operations to modify or extract information out of generated text. Look for the range parameter, it defines what to actually read. It can be set to ‘All pages’ or a specific page, or a range of pages.
There’s a specific action for reading images inside PDF files called ‘Read PDF with OCR’. It uses optical character recognition to scan the images inside the PDF and output all the text as a variable. Unlike its non-OCR siblings, it requires an OCR engine. You can find available ones and add them by searching for ‘OCR’ in the ‘Activities’ pane. The engine itself contains OCR parameters which are common throughout the app – ‘allowed characters’, ‘denied characters’, ‘language’, ‘scale’ and so on, but different engines may have different parameters. If you need to go deeper into how they work, there’s an advanced ‘UI interactions’ video tutorial available.
If background operation is important to you, note that both ‘Read PDF’ Action and the ‘Read PDF with OCR’ actions are self contained; they don’t need other applications open so they can run in the background. However, the PDF file needs to be open when performing OCR, as it only works with on-screen images. It means user must open PDF file and launch the UiPath pdf extracting robot when doing OCR.
The Screen Scraper Wizard
The second method for grabbing large and smaller blocks of text is with the screen scraper wizard found in the ‘Main’ toolbar. The wizard is useful for comparing and choosing a scraping method that also generates the actions for you. A simple mouse hover over the text elements that you need to scrape will make UiPath identify these elements inside the selection you just made and show a preview window of them.
The technology behind UiPath screen scraping senses the UI controls like a human instead of blindly using fixed screen coordinates. It extracts text from running Windows apps, even if they are hidden or covered by another app.
UiPath generally detects the best method for your situation, but you can change the scraping method and the preview will adapt accordingly.
2. Extract specific elements
For PDFs that are in the most common format, Native Text, – its elements are directly accessible to UiPath – there are a few options for getting the data:
Get Text action
This action is also available in the integrated ‘Recorder’. Simply point to the element of your choice and UiPath will generate the ‘Get text’ action and its output variable, displaying it in a message box.
If you want to extract the total value from a series of similar PDF files instead of just a single one, you’ll need to tweak the Selector a bit. The ‘Get text’ Action – like most UI interactions – uses a Selector to identify the correct element and get its value.
You can do it automatically with the help of the ‘Attach to Live Element’ feature. Simply point to another similar element that should also match the current Selector and UiPath will try to fix the Selector for you.
In case it doesn’t turn out the way you want, you can also manually modify it. For this part, it is advisable for you first to get familiar with UiPath Selectors and learn how to edit and debug them. Selectors play a central role in UI automation and knowing your way around them will help in many other ways. Video is here.
Manually, we’ll open the Selector again, only this time in the ‘UiExplorer’ feature to have a better view. After editing key UI elements, you simply copy the new Selector and paste it over the old one. Now it works for both files.
There is another method you can use to achieve the same result. In order to extract a fluctuating value from a series of PDF files you can also explore the ‘Anchor Base’ Activity. It is pretty flexible and allows you to use various actions inside it, like replacing the ‘Find Element’ action with the ‘Find Image’ action. Also you don’t have to deal with Selectors as much anymore. And since PDF files look the same on all systems, you can use ‘Find Image’ without its usual drawbacks. But don’t forget to set the zoom of the document to its actual size before indicating the image to make sure you get a reliable result. This method also handles structural changes to the document, as long as the image and data are present and in the same relationship.
Note that these last two methods require the PDF document to be opened, and the data with which you try to interact must be visible, otherwise it will most probably fail. Make sure you take that into account when building the final automation.
Automate any process with UiPath Studio
The video below explains how to extract data from a single PDF file. It works to extract a general text, whole PDF documents including images, as well as a specific text from a PDF file.
If you want to accomplish batch extraction from multiple files, it is possible through UiPath Studio workflow designer where you can model an automated process by assembling its steps into a visual flow-chart diagram. One activity can read one PDF at a time, but a workflow can read 1000 .pdf files in a few minutes. There are some new features for the Studio, like a start screen that allows you to begin by using best practices templates, making it easier to create automations.
See the screenshots below:
To sum up, the above four or five activities should allow you to handle most PDF extractions you’ll be faced with. There are a couple more activities, like ‘Find Relative Element’ and ‘Scrape Relative’, which you can discover on your own. UIPath is the advanced tool for easy PDF Data Extraction and Automation.
If you’re dealing a lot with scanned documents, you may want to have a look at UiPath’s ‘Image-Based Automation’ video tutorials: