یازدهمین همایش بین المللی زبان شناسی ایران

Phonetic Corpus of Alzahra University: An Introduction

کد مقاله : 1234-ICIL2024-FULL

نویسندگان

ماندانا نوربخش *¹، نفیسه تدین چهارطاق²، ندا حقیقی صابر³

¹دانشگاه الزهرا

²گروه زبان‌شناسی، دانشکده ادبیات، دانشگاه الزهرا، تهران، ایران

³دانشگاه الزهرا (تهران)

چکیده مقاله

A spoken corpus is an audio collection capturing speech characteristics of a language, involving recording, transcription, encoding, data management, and analysis (Adolphs & Knight, 2010, p. 40). Persian has various spoken corpora, including FARSDAT (Bijankhan, Sheikhzadegan, & Roohani, 1994), Sahaand Accented Speech (SAS) database (Pilehvar & Sedaaghi, 2008), Sharif Farsi Audio Visual Database (SFAVD) (Naraghi & Jamzad, 2013), The Hamedan-Bamberg Corpus of Contemporary Spoken Persian (HamBam) (Haig & Rasekh-Mahand, 2022). Notably, the Phonetic Corpus of Alzahra University (PCAU) incorporates features that makes it unique among the other spoken corpora for Persian, .e.g., layered labeling and segmentation and the suitability of the data for acoustic analysis, among other purposes. PCAU comprises three types of data, i.e., text reading, single sentence reading, and narration. All the audio files are accompanied by a text-grid labeled manually in 3 to 9 layers using Praat© including phone, phoneme, syllable, word, sentence, emotion, prosody, information and Persian transcription. The information layer details speaker(s)’ language, dialect, accent, style, gender, age, education, height, weight, recording location, data type, repetition, speed, and purpose. All audio files are in .WAV format, widely supported in speech analysis software. Two phoneticians reviewed all labeling. Funding for this project (grant no. 2539/3/953d) was provided by the Vice-chancellor of Research and Technology at Alzahra University.‬

کلیدواژه ها

Phonetic corpus, PCAU, speech corpus, spoken corpus, Persian

وضعیت: پذیرفته شده برای ارائه شفاهی