[Home] [Linguistics:] [Fielework:] [Databases:] [Linux:] [Debian:] [Windows:] [Miscellaneous]

Making application-neutral text databases for field linguists

The page under development.

Useful links related to this section

Making application-neutral text databases

This page presents procedures for producing application-neutral text databases using Shoebox (Toolbox) and LaTeX. The method aims to support field linguists who wish to publish text materials for academic and vernacular purposes. The method also aims to support to build text databases that are easily modifiable and open for development. This page may be also useful if you are interested in: I have been applying the methodology presented on this page for my ongoing project to provide vernacular materials to the Ata speaking community in New Britain of Papua New Guinea since 2000. The method was also employed to publish the following annotated texts of the Motuna language spoken in Bougainville of Papua New Guinea.

Onishi, Masayuki, Dora Leslie and Therese Minitong Kemelfield. (2003) Motuna Texts (A1-006) (ENDANGERED LANGUAGES OF THE PACIFIC RIM. (A) (2) No.12039246. Grant-in-Aid for Scientific Research on Priority Areas, Ministry of Education, Culture, Sports, Science and Technology)

Orientation

Basic concept to build text database for previously undescribed or less described languages
Basic concept of the methodology presented here is that linguistic database for previously undescribed or less described languages should easily be convertible or modifiable to use on any platforms. The best way to meet this expectation would be to build as simple text databases as possible and avoid influences from application particular features of database programs. I regard Linux environment as a currently best available system for the objective since the system, which is based on the robust Unix, leaves its source code open for development (See GNU Project and Free Software Foundation).

Shoebox (Toolbox) as a text editor
I regard Shoebox (Toolbox) as an useful text editor for field linguists to organise text material, but I do not use the 'interlenearlising' and auto-generation of lexicon of the application in my method because I would like text databases to be neutral from any particular applications. Although I do not use those two useful aspects of the application, I value the concepts such as marker, record and filter used in the program in terms of organising and presenting text databases and prefer Shoebox (Toolbox) to other text editors.

LaTeX
LaTeX will be used to typset records which will be exported from text database in Shoebox. The package gb4e will be used to produce annotated texts.

Procedures
The procedues to organise a text database for publication are as follows:

Markers

Content and function markers
I set two types of markers which I would call 'content markers' and 'function markers' for the purpose of presentation here as described below: In the following sections, I will show you how content and function markers are placed in each record with examples. As was mentioned above, I will follow basic terminologies such as marker, record, field etc. used in Shoebox (Toolbox) . In the examples content markers, function markers and fields are in green, blue and maroon (=red) colour respectively and my comments are in black colour within round brackets. You can also refer to the descriptions of all markers on Tables at the bottom of this page.

Records

Organising text databases
One text database usually consist of more than one record. If, for example, one text material can be divided into ten parts according to your criteria such as intonation unit etc., the database will comprise ten records. Each record begins with the content marker \ref whose fields should consist of reference numbers. An initial record would be numbered as 000, the second, 001 and so on. You can export as many records as you want from Shoebox (Toolbox) with 'consistent change table' (cct files) to typeset with the gb4e package in LaTeX.

Structure of initial records

Initial records need to have the function markers \ltxea (LaTeX entire A) and \ltxeb (LaTeX entire B) as shown below. I usually put background information of text materials to initial records and annotated texts begin from second records.
\ref 000
\ltxea (This marker (LaTeX entire A) will add preamble which you need to write in a cct file to export entire record. The marker is only necessary in initial records.)
\ltxeb (This marker (LaTeX entire B) will add command lines such as '\begin{document}' when exporting entire records. The marker is only necessary in initial records.)
\if Obtained on 5/Mar/2000. location: Milikina.
\dt 25/Jun/2000

Structure of records consist of main texts.

Records which contain text material always need to have the function markers \ltxca (LaTeX current A) \ltxcb (LaTeX current B) in order to export current records to typeset with LaTeX. The following is a typical structure of records which only consist of main text.
\ref 001
\ltxca (This maker will produce preamble which you need to write in a cct file to export current records.)
\vl taatei manina la itema vile ualasou lexe Tuko
\mp taatei manina la itema vile uala-sou lexe Tuko
\gs before true and man one name-3sg.GEN COMP name(m)
\ft Once upon a time, there is a man whose name is Tuko.
\pt bipo turu na wanpela man neim bilong en olsem Tuko.
\gm non-prefixed predicate.
\dc New participant = Tuko.
\sm Tuko = male name.
\dt 26/Feb/2000
\ltxcb (This marker will produce command lines which are necessary to typeset current records.)

Structure of records consist of main texts and additional materials

When you transcribe texts with your informants, you often obtain examples by elicitation to know more about what is going on in the texts. You may want to put these materials to relevant places of your text database. The example below shows how to do this. The fields for the markers in yellow colour are additional material obtained by elicitation. Note that the function marker \gmb (grammar B) should follow the content marker \gm when you put additional materials after the main text as shown below.


\ref 007
\ltxca
\vl io anu muxolu la mupuipuiiu tavi
\mp io anu mu-xolu la mu-pui.pui-o+iu tavi
\gs but 3sgm 3sgm.PERF-stay and 3sgmA.PERF-RED.throw-3sgmO+3sgm.NFBEN spear
\ft and then he stayed and was threwing spear for himself
\pt na em stap na isapim ol spia em sapim supia bilong em
\gm NFBEN:
\gmb
\elt
\vl mupuixeni tavi
\mp mu-pui-o+xeni tavi
\gs 3sgmA.PERF-throw-3sgmO-1sg.NFBEN spear
\ft He threw the spear for me
\pt em sapim bilong mi spia
\gm NFBEN: FBEN suf infelects as: -meni, -iu, -ie, -menge, -mengi, -iqa
\dt 24/Feb/2000
\ltxcb


Structure of final records

Final records need to have the function marker \ltxec (LaTeX entire C) which should be placed at the botom of the records as shown below.

\ref 038 end
\ltxca
\vl anu anesi mo mukalu
\mp anu ane+si mo mu-kalu
\gs 3sgm DIST+only and then 3sgm.S.PERF-finish
\ft That is the story and it finished.
\gm
\dt 26/Feb/2000
\ltxcb
\ltxec (This maker will produce command lines such as \end{document} when exporting entire records. The marker is only necessary in final records.)

Shoebox (Toolbox) markers (GOBACK)

Shobox content markers
Markers Field Name following markers Description
\ref reference \ltxca Reference number
\if information \dt This marker can be put in initial records to provide background information on texts.
\vl vernacular \mp vernacular lines
\mp morpheme \gs morpheme segmentation lines
\gs gloss \ft gloss lines
\ft free translation \pt free translation lines
\pt Tok Pisin translation \gm national language translation lines
\gm grammar \dt or \gmb grammar notes
\sm semantics
semantic notes
\dc discourse
discourse notes
\ph phonetics and phonology
phonetics and phonology notes
\psa phrase structures \psb phrase structures
\dt date \ltxcb date stamp

Shoebox (Toolbox) function markers
Markers Field Name Placement Description
\ltxea LaTeX entire A \ltxeb adding preamble to exported entire records. This marker is only necessary in initial records.
\ltxeb LaTeX entire B \if adding LaTeX command lines such as \begin{document} to exported entire records. This marker is only necessary in initial records.
\ltxec LaTeX entire C none adding command lines such as \end{document} to exported entire records. This marker should be placed at the end of final records.
\ltxca LaTeX current A \vl adding preamble to exported current records
\ltxcb LaTeX current B none adding LaTeX command lines such as \end{document} to the exporded current records
\gmb grammar b \dt or \al or \elt
\psb phrase structure b \psa adding LaTeX command lines for phrase structures.
\al analysis \gmb You can use this marker to add a section heading 'analysis' when you add matrials such as alternative analyses etc. after main texts.
\elt eliciation \gmb You can use this marker to add a section heading 'elicitation' when you add matrials such as clauses from elicitation etc. after main texst.

cct files

Exporting files from Shoebox (Toolbox) to LaTeX (gb4e format)


TOP