In order to support our partners in making their vocabularies available and browsable via the RVA portal, we provide this guide and accompanying ingestion template, which outline a transformation and ingestion process. In addition, section 3 details the discussion of a completed example of this transformation and ingestion process. If you have any additional questions about the transformation or ingestion of your vocabulary, please contact [email protected] .
In what format is the vocab currently being maintained/stored?
Examples of formats in which a vocabulary may be stored: Spreadsheet, CSV, PDF, text, HTML, RDF, database tables, etc.
Has ARDC already developed a transformation and ingestion process for that format? Current processes can be found here.
Has your organisation developed a process to transform the current format to RDF? If not, we will work with you to develop a process.
Note : This is a guide for cases in which there are vocabularies that have a semantic model that can be adequately expressed within the constraints of a spreadsheet or comma separated values (CSV) format. For information about the transformation and ingestion of vocabularies that are maintained/stored in other formats, please consult our other transformation and ingestion guides .
How are the vocabulary concepts currently being described?
What are the elements used to describe metadata about the concepts?
What do these elements mean?
How do the current elements used to describe metadata about the concepts map to the vocabulary ingestion template?
In order for your vocabulary to be ingested into Research Vocabularies Australia, the information provided in the original format needs to be translated into the ingestion template provided below.
The template allows ARDC partners to indicate what information about the vocabulary should be captured within the following elements:
|
And the following tag:
Language | @lang |
|
---|
This is not a complete list of all elements which can be captured for your vocabulary in the ARDC Vocabulary Service. If your organization captures extra information that does not fall under the listed elements or tag, we can work with you to create a solution for including that information in your transformation. Please contact [email protected] if you have any questions about your transformation process. |
In order for the vocabulary to be ingested properly, the hierarchy (narrower and broader nature of the concepts) must be notated in a machine-readable way. This may require some reorganization of the concepts for insertion into the ingestion template.
What preprocessing needs to be done?
Are there any additional requirements of the vocab owners or other stakeholders that might impact the transformation or ingestion of the vocabulary into the ARDC Vocabulary Service?
Have all non-ingestible ( non-ASCII ) symbols been removed?
Is the vocabulary multilingual (does it include content in multiple languages)? If so, please provide ARDC with a list of languages used in the vocabulary prior to ingestion.
The process of vocabulary transformation and ingestion has been performed on the ANZSRC-FOR vocabulary, and the artefacts from that process are provided here as an example for future use.
This is just one example of the transformation of a vocabulary, and is meant to be used as a learning tool. The steps taken in order to transform your vocabulary may vary from those outlined below. Please contact [email protected] if you have any questions about your transformation process. |
The vocabulary was initially provided as an Excel spreadsheet:
The ANZSRC-FOR vocabulary spreadsheet includes the title of the vocabulary, some column headings to explain how to read the spreadsheet (shown below in grey cells), names of vocabulary concepts, and codes that correspond to the concepts. This structure and code scheme is explained in detail by the Australian Bureau of Statistics here .
Examination of the vocabulary in its original spreadsheet format reveals that the given column headings (shown below in grey cells) do not provide all of the information we need to transform the spreadsheet. There are multiple types of information recorded in individual columns:
In order for ANZSRC-FOR to be ingested into the ARDC Vocabulary Service, the content provided in the original spreadsheet needs to be entered into the i ngestion template provided by ARDC. The template allows us to indicate what original vocabulary content should be captured within the following elements:
URI | <uri> |
|
---|---|---|
Concept | <concept> |
|
Notation | <notation> |
|
This is not a complete list of all elements which can be captured for your vocabulary in the ARDC Vocabulary Service. If your organization captures extra information that does not fall under the listed elements or tag, we can work with you to create a solution for including that information in your transformation. Please contact [email protected] if you have any questions about your transformation process. |
In the case of ANZSRC-FOR, the elements used are unique identifier , concept and notation . Because the original ANZSRC-FOR spreadsheet doesn’t include content such as concept definitions or alternate labels for concepts (and in fact, these pieces of information don’t exist for this particular vocabulary), those columns are left blank in the completed ANZSRC-FOR ingestion template example.
The ingestion template allows for ARDC partners to capture information about the hierarchical structure of their vocabulary and metadata about the concepts in one document.
A number of steps were performed in order to properly record metadata about the ANZSRC-FOR concepts in the ingestion template.
The preferred labels were pulled from columns B, C and D of the original spreadsheet and pasted into the column titled “concept” in the template and the codes corresponding to the Preferred labels (pulled from columns A, B and C of the original spreadsheet) were pasted into the column title “notation” ensuring that codes corresponding with labels were pasted into the same row of the spreadsheet.
For example:
|
becomes:
|
2. Because ANZSRC-FOR is a monolingual vocabulary (English language), no language tags are necessary.
3. Unique identifiers for each concept were created based on the Preferred labels by making the labels all lowercase and inserting hyphens between the words using the =lower and =substitute functions and by deleting all punctuation and any text within parentheses.
For example:
Unique identifiers corresponding with labels were used to create URIs for each concept (using the predefined URI structure) and were inserted into the URI column of the spreadsheet.
The completed vocabulary ingestion template for the ANZSRC-FOR is available in spreadsheet format and CSV format .