Investigate and develop state-of-the-art Machine Learning / Deep Learning-based approaches to automated information extraction tools to increase the discoverability of industry reports held within the Geological Survey of Queensland archive.
The Challenge
The Geological Survey of Queensland (GSQ) within the Queensland Department of Resources has a vast archive of industry reports collected predominantly as a part of the legislative requirements for mineral and petroleum exploration and production companies in the State. These reports contain a wealth of information, including exploration rationales, detailed discussions of analytical results and data on the methods and results of the activities detailed within these reports.
Historically the extraction and transcription of information from these reports into searchable databases, for both industry and government use, was a manual and error-prone process, with up to 80% of a typical project timeframe devoted to the extraction, quality assurance and quality control of data from reports.
Building on an earlier collaboration between Queensland Government, FrontierSI and QUT to automate the digitisation of the existing documents, an approach was sought to further automate the extraction of data in a publishable form.
Partners
This project was a partnership between FrontierSI, Queensland Department of Resources, Queensland University of Technology and Australian Spatial Analytics.
The Solution
The Industry Report Value Extraction project investigated and developed automated information extraction tools, methodologies and capabilities, powered by state-of-the-art Machine Learning / Deep Learning -based language methods, models and algorithms being developed at QUT to increase the amount of searchable data available for stakeholders.
To achieve this the project set out to investigate the options available for the extraction of data and contextual concepts from reports, test the approaches and, where successful, determine the feasibility and scalability of automating data extraction, and applying the chosen methodologies to the datasets to produce structured data output for publication.
The project brought together the extensive Machine Learning (ML) and Deep Learning (DL) research experience and capabilities of the QUT Centre for Data Science, along with the growing data analysis and annotation of Australian Spatial Analytics, who are a Queensland-based registered not-for-profit social enterprise dedicated to training and employing remarkable young data analysts with autism.
The outcomes were in three distinct components, each with a different set of methodologies, tools, outcomes, and outputs:
- The development and evaluation of training data and deep learning models to perform key-value extractions from standardised reports.
- The decomposition, classification and tagging of non-standardised reports to make the reports discoverable and useful.
- Identification of high-value data tables from reports and extraction of the most relevant information.
Impact
GSQ achieved significant value from the project. The outcomes of the deep learning models will support the creation of publishable products, such as a curated and quality-assured directory of summary petroleum reports. The outcomes of the report classification and tagging allow the GSQ team to better understand the complexities of data extraction from unstructured datasets while providing GSQ’s data scientists with the tools and methods to further develop and publish products. It also identified opportunities for the use of large language models to improve and increase the value extraction processes, which GSQ’s data scientists are now investigating.
Contact
To learn more, contact FrontierSI at contact@frontiersi.com.au.