• Title/Summary/Keyword: cross-speaker

Search Result 54, Processing Time 0.027 seconds

Masked cross self-attentive encoding based speaker embedding for speaker verification (화자 검증을 위한 마스킹된 교차 자기주의 인코딩 기반 화자 임베딩)

  • Seo, Soonshin;Kim, Ji-Hwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.39 no.5
    • /
    • pp.497-504
    • /
    • 2020
  • Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.

Combination of Classifiers Decisions for Multilingual Speaker Identification

  • Nagaraja, B.G.;Jayanna, H.S.
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.928-940
    • /
    • 2017
  • State-of-the-art speaker recognition systems may work better for the English language. However, if the same system is used for recognizing those who speak different languages, the systems may yield a poor performance. In this work, the decisions of a Gaussian mixture model-universal background model (GMM-UBM) and a learning vector quantization (LVQ) are combined to improve the recognition performance of a multilingual speaker identification system. The difference between these classifiers is in their modeling techniques. The former one is based on probabilistic approach and the latter one is based on the fine-tuning of neurons. Since the approaches are different, each modeling technique identifies different sets of speakers for the same database set. Therefore, the decisions of the classifiers may be used to improve the performance. In this study, multitaper mel-frequency cepstral coefficients (MFCCs) are used as the features and the monolingual and cross-lingual speaker identification studies are conducted using NIST-2003 and our own database. The experimental results show that the combined system improves the performance by nearly 10% compared with that of the individual classifier.

A method of the cross-talk cancellation for an sound reproduction of 5.1 channel speaker system (5.1 채널 스피커 시스템 음향재생을 위한 크로스토크 제거방법)

  • Lee, Soo-Jeong;Cho, Gab-Ken;Kim, Soon-Hyob
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.42 no.4 s.304
    • /
    • pp.159-166
    • /
    • 2005
  • This thesis deals with a method to deliver more realistic sound by cancelling the cross-talk which is inherent to the 5.1 channel speaker system. First, the cross-talk cancellation method that eliminates cross-talks on the path from left speaker to right ear and from right speaker to left ear is explained. Then the application and replaying method using the cross-talk cancellation explained here is introduced. The acoustical model for cross-talk cancellation is the free field model This model minimizes distortion of sound. Many experts also make studies on this model. I used the bark scale sound quality compensation based on psycho-acoustic. For the surround channels, band-limited sound quality compensation is performed in the frequency domain.

A Speaker Detection System based on Stereo Vision and Audio (스테레오 시청각 기반의 화자 검출 시스템)

  • An, Jun-Ho;Hong, Kwang-Seok
    • Journal of Internet Computing and Services
    • /
    • v.11 no.6
    • /
    • pp.21-29
    • /
    • 2010
  • In this paper, we propose the system which detects the speaker, who is speaking currently, among a number of users. A proposed speaker detection system based on stereo vision and audio is mainly composed of the followings: a position estimation of speaker candidates using stereo camara and microphone, a current speaker detection, and a speaker information acquisition based on a mobile device. We use the haar-like features and the adaboost algorithm to detect the faces of speaker candidates with stereo camera, and the position of speaker candidates is estimated by a triangulation method. Next, the Time Delay Of Arrival (TDOA) is estimated by the Cross Power Spectrum Phase (CPSP) analysis to find the direction of source with two microphone. Finally we acquire the information of the speaker including his position, voice, and face by comparing the information of the stereo camera with that of two microphone. Furthermore, the proposed system includes a TCP client/server connection method for mobile service.

Cross-speaker anaphora in dynamic semantics

  • Yeom, Jae-Il
    • Language and Information
    • /
    • v.14 no.2
    • /
    • pp.103-129
    • /
    • 2010
  • In this paper, I show that anaphora across speakers shows both dynamic and static sides. To capture them all formally, I will adopt semantics based on the assumption that variables range over individual concepts that connect epistemic alternatives. As information increases, a variable can take a different range of possible individual concepts. This is captured by the notion of virtual individual (= vi), a set of individual concepts which are indistinguishable in an information state. The use of a pronoun involves two information states, one for the antecedent, which is always part of the common ground, and the other for the pronoun. Information increase changes vis for variables in the common ground. A pronoun can be used felicitously if there is a unique virtual individual in the information state for the antecedent which does not split in two or more distinctive virtual individuals in the information state for the pronoun. The felicity condition for cross-speaker anaphora can be satisfied in declaratives involving modality, interrogatives and imperatives in a rather less demanding way, because in these cases the utterance does not necessarily require non-trivial personal information for proper use of a pronoun.

  • PDF

A Study of the Giving and Receiving Verbs in TOUSEISYOUSEIKATAGI (『当世書生気質』에 나타난 수수동사에 관한 고찰 - 'やる·あげる·さしあげる'와 'くれる·くださる'를 중심으로)

  • Yang, Jung Soon
    • Cross-Cultural Studies
    • /
    • v.19
    • /
    • pp.271-293
    • /
    • 2010
  • Japanese Give and Receive Verbs are divided into "YARU", "MORAU" and "KURERU". These are influenced by the subject, speaker's viewpoint and meaning. Three verbs are used in a different way depending on who is the giver and who is the taker. I analyze "YARU" and "KURERU" Verbs used in TOUSEISYOUSEIKATAGI. It focus on politeness, gender, and meaning when combined with 'TE'. As an expression of politeness, 'Yaru' is to give to a person of lower social status or an animal or plant. 'Ageru' is to give to an equal ora person of lower social status nowadays. However, 'Ageru' which is treated as elegance of the language remained expression of respect, 'Yaru' is used when the receiver is a person of lower social status and equal social status in TOUSEISYOUSEIKATAGI. 'Kureru' is used when the receiver is a person of lower social status and equal social status, 'kudasaru' is used when a person of higher social status gives the speaker something in TOUSEISYOUSEIKATAGI. Women speakers use 'oyarinasai' 'oyariyo' 'ageru' 'okureru' and men speakers use 'yaru' 'kureru'. Speech patterns peculiar to men are 'kuretamae' 'kurenka'. If the verbs are joined to "TE", they obtain abstract meaning as well as a movement of things. They express some modality for action of the preceeding verbs. The modality has the following meanings ; good will, goodness, benefits, kindness, hopeness, expectation, disadvantage, injury, ill will and sarcasm. In addition, 'TE YARU' expresses the speaker's strong will, 'TE KURERU' expresses the speaker's request.

The Expression of Ending Sentence in Family Conversations in the Virtual Language - Focusing on Politeness and Sentence-final Particle with Instructional Media - (가상세계 속에 보인 일본어의 가족 간의 문말 표현에 대해 - 교수매체로서의 문말의 정중체와 종조사 사용에 대해)

  • Yang, Jung-Soon
    • Cross-Cultural Studies
    • /
    • v.39
    • /
    • pp.433-460
    • /
    • 2015
  • This paper was analyzed the politeness and the expression of ending sentence in family conversations in the virtual language of cartoon characters. Younger speakers have a tendency to unite sentence-final particle to the polite form, older speakers have a tendency to unite it to the plain form in the historical genre. But younger speakers and older speakers unite sentence-final particle to the plain form in other fiction genres. Using terms of respect is determined by circumstances and charactonym. Comparing the translation of conversations with the original, there were the different aspects of translated works. When Japanese instructors are used to study Japanese as the instructional media, they give a supplementary explanation to students. 'WA' 'KASIRA' that a female speaker usually uses are used by a male speaker, 'ZO' 'ZE' that a male speaker usually uses are used by a female speaker in the virtual language of cartoons. In the field of the translation, it is translated 'KANA' 'KASIRA' into 'KA?', 'WA' 'ZO' 'ZE' into 'A(EO)?', 'WAYO' 'ZEYO' into AYO(EOYO)'. When we use sentence-final particle in the virtual language of cartoon, we need to supply supplementary explanations and further examinations.

Changes in Features of Korean Vowels with Age and Sex of Speakers and Their Recognition (한국어 단모음의 성별, 연령별 특징변화 및 인식)

  • 이용주;김경태;차균현
    • Journal of the Korean Institute of Telematics and Electronics
    • /
    • v.25 no.12
    • /
    • pp.1503-1512
    • /
    • 1988
  • As the basic analysis to solve the within-and cross-speaker variability in phoneme based speech recognition, changes in pitch and formant frequencies of 8 Korean vowels with age and sex of speaker has been investigated by analyzing a large number fo samples. Conclusions obtained are as follows: 1) Changes in pitch frequency with age and sex of speaker for children are hard to distinguish and the difference of before and after the voice change is analyzed approximately 0.2 oct. for female an 0.9 oct. for male. 2) While most of the formants of vowel considerably change with the age of speaker, the change becomes smaller as the age becomes older. 3) While there is an indirect correlation between pitch and formant with change in age, it is hard to see a direct correlation. 4) When the objects of the recognition experiment by pitch and formants are various speakers in each age and sex, pitch also works as an efficient recognition parameter.

  • PDF

A Study on the Transaural Filter Implementation for 5.1 Channel Speaker System (5.1채널 스피커 시스템에서 트랜스오럴 필터 구현에 관한 연구)

  • 최갑근;방승범;김순협;정완섭
    • The Journal of the Acoustical Society of Korea
    • /
    • v.21 no.3
    • /
    • pp.245-255
    • /
    • 2002
  • This thesis deals a method to deliver more realistic sound by cancelling the cross-talk which is inherent to the 5.1 channel speaker system. The acoustical model for cross-talk cancellation is the free field model. This model minimizes distortion of sound. I used the bark scale sound quality compensation which based on psycho-acoustic. For the surround channels, band-limited sound quality compensation is performed in the frequency domain. I also performed the sound quality assessment test on the traditional 2 channel stereo and 5.1 channel system. This test is performed in the test chamber which satisfies the ITU-R specifications. I uses the IACC (Inter-Aural Cross-Correlation) to determine the preferences of the amateur and the golden ear experts to asses the trans-aural filter. According to the result from the proposed method, I got more the 38 dB separation rates with the Dolby standard speaker array. The results on the diffusion by the subjective test with the experts shows 0.4 point increased then before.

Investigation on Vibration Characteristics of Micro Speaker Diaphragms for Various Shape Designs (마이크로 스피커 진동판의 형상설계에 따른 진동특성 고찰)

  • Kim, Kyeong Min;Kim, Seong Keol;Park, Keun
    • Journal of the Korean Society for Precision Engineering
    • /
    • v.30 no.8
    • /
    • pp.790-796
    • /
    • 2013
  • Micro-speaker diaphragms play an important role in generating a desired audio response. The diaphragm is generally a circular membrane, and the cross section is a double dome, with an inner dome and an outer dome. To improve the sound quality of the speaker, a number of corrugations may be included in the outer dome region. In this study, the role of these corrugations is investigated using two kinds of finite element method (FEM) calculations. Structural FEM modeling was carried out to investigate the change in stiffness of the diaphragm when the corrugations were included. Modal FEM modeling was then carried out to compare the natural frequencies and the resulting vibrational modes of the plain and corrugated diaphragms. The effects of the corrugations on the vibration characteristics of the diaphragm are discussed.