Rule-Based Automatic Generation of Electronic Theses and Dissertations Metadata

In Universities postgraduates are required to submit theses and dissertations upon the completion of their studies. These are electronically uploaded to the repository and their metadata is manually generated and entered by the authorised people. The manual upload of theses and dissertations has led to the missing of metadata about the writings which makes it difficult for the lecturers and general public (students, postgraduates etc.) to access certain metadata elements from some records on the repository. Theses and dissertations are known to be the rich, unique source of information and hence they need to be paid attention to when uploading them. In an attempt to find a solution to this problem, this paper suggested and made use of automatic generation of metadata to identify the missing inputs about the writings in the ETDs section. Identification of the missing metadata elements was done by harvesting the metadata from the UNZA repository using the Open Archives Protocol for Metadata Harvesting (OAI-PMH), a widely adopted approach to allow harvesting of metadata. [8] This involved the pulling of harvested metadata using OAI-PMH URL validator, an implemented git bash script to download the records and carrying out a data analysis using an Excel Spreadsheet. Identifying the source of missing metadata elements from the manuscripts was achieved by reading through the Directorate of Research and Graduate Studies (DRGS) guidelines and then, randomly sampling out 60 Electronic Thesis and Dissertations (ETDs) from the UNZA repository. To determine the appropriate extraction method, the acknowledgements pages were first extracted from the 4149 PDF files, then converted to text and finally loaded to a pandas dataframe. Furthermore, rule-based matching techniques such as Spacy were used in a python script to extract the contributor (advisors) metadata details. Observably, excel analysis showed that only Eleven Dublin core elements were exported from openrefine out of a total of fifteen standard Dublin core elements. In addition to that, it was clearly observed that metadata elements such as contributor, source, coverage and rights were highly missing. After undertaking the analysis of the DRGS guidelines and the randomly sampling of 60 records from 12 schools, the major outcomes of possible elements drawn from the analysis showed that the metadata elements are found in the Approval and Acknowledgement section of the manuscript, but mostly on the Acknowledgements. It was observed that while trying to extract the supervisor details from the acknowledgements, the software library leaves out the salutation for the names. This is because SpaCy has a pre-set figure of speech that is capable of identifying the name from the sentences. In addition to that, records that never had the supervisor details were automatically skipped by the script. In Conclusion, the automatic extraction of the metadata from the manuscript is more effective as compared to the manual process. This conclusion was drawn based on the evaluation tested using the natural language processing metrics such as BLEU scores which take in the weight based on the human generated results versus the machine generated results.
Year of Publication
Number of Pages
Date Published
Capstone Research Project Report
University of Zambia
Lusaka, Zambia
Nyambe, Frazer, Nkole Mulenga, Richard Mufuzi, and Geoffrey Ngoma. 2022. “Rule-Based Automatic Generation Of Electronic Theses And Dissertations Metadata”. Lusaka, Zambia: University of Zambia.