Phonetic Corpus of Alzahra University: An Introduction |
کد مقاله : 1234-ICIL2024-FULL |
نویسندگان |
ماندانا نوربخش *1، نفیسه تدین چهارطاق2، ندا حقیقی صابر3 1دانشگاه الزهرا 2گروه زبانشناسی، دانشکده ادبیات، دانشگاه الزهرا، تهران، ایران 3دانشگاه الزهرا (تهران) |
چکیده مقاله |
A spoken corpus is an audio collection capturing speech characteristics of a language, involving recording, transcription, encoding, data management, and analysis (Adolphs & Knight, 2010, p. 40). Persian has various spoken corpora, including FARSDAT (Bijankhan, Sheikhzadegan, & Roohani, 1994), Sahaand Accented Speech (SAS) database (Pilehvar & Sedaaghi, 2008), Sharif Farsi Audio Visual Database (SFAVD) (Naraghi & Jamzad, 2013), The Hamedan-Bamberg Corpus of Contemporary Spoken Persian (HamBam) (Haig & Rasekh-Mahand, 2022). Notably, the Phonetic Corpus of Alzahra University (PCAU) incorporates features that makes it unique among the other spoken corpora for Persian, .e.g., layered labeling and segmentation and the suitability of the data for acoustic analysis, among other purposes. PCAU comprises three types of data, i.e., text reading, single sentence reading, and narration. All the audio files are accompanied by a text-grid labeled manually in 3 to 9 layers using Praat© including phone, phoneme, syllable, word, sentence, emotion, prosody, information and Persian transcription. The information layer details speaker(s)’ language, dialect, accent, style, gender, age, education, height, weight, recording location, data type, repetition, speed, and purpose. All audio files are in .WAV format, widely supported in speech analysis software. Two phoneticians reviewed all labeling. Funding for this project (grant no. 2539/3/953d) was provided by the Vice-chancellor of Research and Technology at Alzahra University. |
کلیدواژه ها |
Phonetic corpus, PCAU, speech corpus, spoken corpus, Persian |
وضعیت: پذیرفته شده برای ارائه شفاهی |