Transform a vocabulary into RVA editor spreadsheet format from any other spreadsheet/CSV format
1. Introduction
In order to support our partners in making their vocabularies available and browsable via the RVA portal, we provide this guide and accompanying ingestion template, which outline a transformation and ingestion process. In addition, section 3 details the discussion of a completed example of this transformation and ingestion process. If you have any additional questions about the transformation or ingestion of your vocabulary, please contact [email protected][email protected].orgedu.au .
2. Getting started: Questions about your vocabulary
File formats
In what format is the vocab currently being maintained/stored?
Examples of formats in which a vocabulary may be stored: Spreadsheet, CSV, PDF, text, HTML, RDF, database tables, etc.
Has ANDS ARDC already developed a transformation and ingestion process for that format? Current processes can be found here.
Has your organisation developed a process to transform the current format to RDF? If not, we will work with you to develop a process.
Note : This is a guide for cases in which there are vocabularies that have a semantic model that can be adequately expressed within the constraints of a spreadsheet or comma separated values (CSV) format. For information about the transformation and ingestion of vocabularies that are maintained/stored in other formats, please consult our other transformation and ingestion guides .
Concept definitions
How are the vocabulary concepts currently being described?
What are the elements used to describe metadata about the concepts?
What do these elements mean?
How do the current elements used to describe metadata about the concepts map to the vocabulary ingestion template?
In order for your vocabulary to be ingested into Research Vocabularies Australia, the information provided in the original format needs to be translated into the ingestion template provided below.
View file name Vocabulary ingestion template [spreadsheet-csv].xlsx height 250
The template allows ANDS ARDC partners to indicate what information about the vocabulary should be captured within the following elements:
Advanced Tables - Table Plus | |||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||
|
And the following tag:
Language | @lang |
|
---|
Info |
---|
This is not a complete list of all elements which can be captured for your vocabulary in the ANDS ARDC Vocabulary Service. If your organization captures extra information that does not fall under the listed elements or tag, we can work with you to create a solution for including that information in your transformation. Please contact [email protected][email protected].orgedu.au if you have any questions about your transformation process. |
Hierarchical structure
In order for the vocabulary to be ingested properly, the hierarchy (narrower and broader nature of the concepts) must be notated in a machine-readable way. This may require some reorganization of the concepts for insertion into the ingestion template.
Additional preprocessing considerations
What preprocessing needs to be done?
Are there any additional requirements of the vocab owners or other stakeholders that might impact the transformation or ingestion of the vocabulary into the ANDS ARDC Vocabulary Service?
Have all non-ingestible ( non-ASCII ) symbols been removed?
Is the vocabulary multilingual (does it include content in multiple languages)? If so, please provide ANDS ARDC with a list of languages used in the vocabulary prior to ingestion.
3. ANZSRC-FOR: An example transformation
The process of vocabulary transformation and ingestion has been performed on the ANZSRC-FOR vocabulary, and the artefacts from that process are provided here as an example for future use.
Info |
---|
This is just one example of the transformation of a vocabulary, and is meant to be used as a learning tool. The steps taken in order to transform your vocabulary may vary from those outlined below. Please contact [email protected][email protected].orgedu.au if you have any questions about your transformation process. |
In what format is the vocabulary currently being maintained/stored?
The vocabulary was initially provided as an Excel spreadsheet:
Embedded Google Drive File docid 1kXz1696Lqd50Lab_x_Ex-cHtDNuwNN4FOAUH0wEZUYI
How are the vocabulary concepts currently being described?
The ANZSRC-FOR vocabulary spreadsheet includes the title of the vocabulary, some column headings to explain how to read the spreadsheet (shown below in grey cells), names of vocabulary concepts, and codes that correspond to the concepts. This structure and code scheme is explained in detail by the Australian Bureau of Statistics here .
Preprocessing of the vocabulary
Examination of the vocabulary in its original spreadsheet format reveals that the given column headings (shown below in grey cells) do not provide all of the information we need to transform the spreadsheet. There are multiple types of information recorded in individual columns:
In order for ANZSRC-FOR to be ingested into the ANDS ARDC Vocabulary Service, the content provided in the original spreadsheet needs to be entered into the i ngestion template provided by ANDSARDC. The template allows us to indicate what original vocabulary content should be captured within the following elements:
URI | <uri> |
|
---|---|---|
Concept | <concept> |
|
Notation | <notation> |
|
Info |
---|
This is not a complete list of all elements which can be captured for your vocabulary in the ANDS ARDC Vocabulary Service. If your organization captures extra information that does not fall under the listed elements or tag, we can work with you to create a solution for including that information in your transformation. Please contact [email protected][email protected].orgedu.au if you have any questions about your transformation process. |
In the case of ANZSRC-FOR, the elements used are unique identifier , concept and notation . Because the original ANZSRC-FOR spreadsheet doesn’t include content such as concept definitions or alternate labels for concepts (and in fact, these pieces of information don’t exist for this particular vocabulary), those columns are left blank in the completed ANZSRC-FOR ingestion template example.
The ingestion template allows for ANDS ARDC partners to capture information about the hierarchical structure of their vocabulary and metadata about the concepts in one document.
Concept metadata
A number of steps were performed in order to properly record metadata about the ANZSRC-FOR concepts in the ingestion template.
The preferred labels were pulled from columns B, C and D of the original spreadsheet and pasted into the column titled “concept” in the template and the codes corresponding to the Preferred labels (pulled from columns A, B and C of the original spreadsheet) were pasted into the column title “notation” ensuring that codes corresponding with labels were pasted into the same row of the spreadsheet.
For example:
Advanced Tables - Table Plus | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||
|
becomes:
Advanced Tables - Table Plus | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||
|
2. Because ANZSRC-FOR is a monolingual vocabulary (English language), no language tags are necessary.
3. Unique identifiers for each concept were created based on the Preferred labels by making the labels all lowercase and inserting hyphens between the words using the =lower and =substitute functions and by deleting all punctuation and any text within parentheses.
For example:
- Preferred label of Analytical Chemistry becomes unique identifier analytical-chemistry
- Preferred label of Automotive Combustion and Fuel Engineering (incl. Alternative/Renewable Fuels) becomes unique identifier automotive-combustion-and-fuel-engineering
Unique identifiers corresponding with labels were used to create URIs for each concept (using the predefined URI structure) and were inserted into the URI column of the spreadsheet.
Completion of ANZSRC-FOR example
The completed vocabulary ingestion template for the ANZSRC-FOR is available in spreadsheet format and CSV format .
Embedded Google Drive File docid 1O94js62ailsBEsnlwLl-Gl3Nz2ehszxu58pI7HpU3UA