L2/01-143
ISO/IEC JTC 1/SC 2/WG 2 N 2340
The aim of this project is to
develop methods and standards for data entry and encoding of dialect
transcripts. Though independent, the project is associated with other Swedish
dialect documentation projects.
The archives of Swedish dialect
research institutions contain large collections of recordings and transcripts. Most
of these transcripts are written with an unique phonetic alphabet, the Swedish
dialect alphabet, specially designed in 1878 by Professor J. A. Lundell of
Uppsala. Only a limited number of scholars are able to read these transcripts,
which contain a large amount of information of both linguistic and ethnological
interest. To make these dialect collections accessible to the general public,
various kinds of software tools are needed for the encoding, conversion,
search, and display of these texts.
The project is divided into three
subprojects: Encoding tools, Character codes, and Text conversion.
1. Encoding tools
This subproject involves the
creation of detailed standards for dialect recording transcript formats and the
development of software and other tools for entering dialect texts.
• Making an XML schema for
dialect transcripts, containing tags for various types of metadata.
• Collecting information about
existing software tools - fonts, keyboard modifiers, OCR software etc.
• Development of new software
where existing software isn't sufficient.
• Creation of manuals and other
types of training material for the persons carrying out the actual work of
entering, importing, and cataloguing dialect transcripts.
2. Character codes
This subproject involves the
documentation of the phonetic values of all the characters of the Swedish
dialect alphabet and including these characters in the ISO 10646/Unicode
character code standard.
• Investigating actual character
usage in different dialects and various types of dialect texts. This phase
should include cataloguing the usage of similar dialect alphabets in Denmark
and Norway.
• Making translation tables
between the Scandinavian dialect alphabets and the International phonetic
alphabet (the IPA alphabet).
• Making a proposal for including
the characters of the dialect alphabets in ISO 10646/Unicode, the international
character code standard.
• Making Unicode-encoded fonts
for free distribution, containing all the Scandinavian dialect characters.
3. Text conversion
This subproject contains the
development of sets of rules for conversion between various orthographic
systems, e.g. conversion from the Swedish dialect alphabet into IPA or
'phonetic spelling', i.e. ordinary alphabetic characters indicating the pronunciation.
There will probably be a large number of rule sets, due to phonetic
differencies between the dialects and the demand for different levels of
accuracy.
Figure 1. At the top, a
transcript written with the Swedish dialect alphabet; in the middle, the same
text automatically converted to IPA characters; at the bottom, the same text
automatically converted to Swedish 'phonetic spelling'.
(c) 2001 Benny Brodda och Lars
Törnqvist
Uppdaterad 2001-03-10