Develop Methods for Automated Data Extraction into a Machine-Readable Database for Subsequent Data Query and Reporting
The Geological Survey of Queensland (GSQ) within the Queensland Department of Natural Resources, Mines and Energy (DNRME) has a vast library of industry reports which have been collected predominantly as a part of the legislative requirements for exploration and production companies in the State. The library contains approximately 100,000 reports, each with multiple supporting documents and appendices. These reports contain a wealth of knowledge and are the only source for most of this information. To extract this information, for use by both industry and government, an individual was previously required to read each report, identify relevant components, and manually transcribe them into a database, which was extremely time consuming and introduced significant potential for error. This project aimed to automate the digitisation of the existing documents i.e., the process of extracting text, tables, and pictures into a searchable form.
The project partners were the Queensland Department of Natural Resources, Mines and Energy (DNRME) and Queensland University of Technology (QUT).
Ultimately the GSQ aimed to make the wealth of data currently sitting inside nonconfidential static reports open and accessible to stakeholders. Extraction of the various forms of data including text, tables and pictures was performed in a staged way using a suite of data extraction, information retrieval and text mining tools and methods. As data was extracted and made ‘searchable’ it was integrated into the GSQ Open Data Portal (Horizon).
This project had two major outputs:
- The Converted Reports
- Greater than 60,000 documents were converted to Word or text files, but an average of 15% of documents for each year failed to convert with the selected methods.
- Formats converted included PDF, TIFF and JPG files as well as image thumbnails.
- Some of the converted documents were from as far back as the 1920s.
- The Interactive Information Extraction (IIE) System
- IIE indexes the converted reports and enables DNRME users to search for information intelligently and intuitively.
- The system has two components: back-end and front-end.
- The back-end includes conversion of the original documents, data storage, and indexing of the extracted text.
- The front-end includes a dashboard to search the converted documents, and visualisation of the search results as Text, Image, Table or Concept map.
- IIE system codes were provided to DNRME along with three user manuals (a user guide, an installation guide and a technical guide for developers) to support its functionality and usage.
The industry reports contain a wealth of knowledge and are the only source for most of this information. It was previously estimated that 80% of a typical project timeframe was devoted to the extraction, quality assurance and quality control of data from the reports. The IEE system saves users a lot of time by automating this process. Access to this converted and searchable document collection has also enabled a range of new insights and forecasting employing data mining and statistical modelling.
This first stage of work provides a foundation for scaling the data extraction methods. Areas of future work could include:
- Decreasing the number of documents that failed the conversion process, potentially using machine learning-based solutions.
- Evaluating the quality of conversions with an automated process.
- Indexing images within a document, as currently only images as documents (JPGs) are indexed and searchable. Any images within a document are not considered. Images within a document pose a lot of challenges and may require machine learning approaches, focussed on the extraction of image captions.
- Developing a table caption identification model.
- Enhancing the user search functionality.
To extend the IIE System, the QUT Team has offered masters coursework student projects to apply data mining on the converted document collection to gain valuable insights from this vast and useful collection and improve the accuracy of data extraction. The student projects in 2020, semester 2 wrre:
- Topic modelling with spatial-temporal pattern mining.
- Grouping documents with a certain style/format or from a company.
- Ability to fill a digital well card automatically/semi-automatically.
Project reports detailing useful insight from this vast collection of data were provided to GSQ.
To learn more, contact FrontierSI at email@example.com