This site will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device.
|
Statement-based Semantic Annotation of Media Resources Paper (pdf) W. Weiss, W. Halb (JRS) T. Bürger (LFUI), R. Villa, P. Swamy (UG) 4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria) Currently the media production domain lacks efficient ways to organize and search for media assets. Ontology-based applications have been identified as a viable solution to this problem, however, sometimes being too complex for non-experienced users. We present a fast and easy to use approach to create semantic annotations and relationships of media resources. The approach is implemented in the SALERO Intelligent Media Annotation & Search system. It combines the simplicity of free text tagging and the power of semantic technologies and by that makes a compromise in the complexity of full semantic annotations. We present the implementation of the approach in the system and an evaluation of different user interface techniques for creating annotations. |
|
CorpVis: An Online Emotional Speech Corpora Visualisation Interface Paper (pdf) C. Cullen, B. Vaughan, J. McAuley, E. McArthy (DIT) 4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria) Our research in emotional speech analysis has led to the construction of several dedicated high quality, online corpora of natural emotional speech assets. The requirements for querying, retrieval and organization of assets based on both their metadata descriptors and their analysis data led to the construction of a suitable interface for data visualization and corpus management. The CorpVis interface is intended to assist collaborative work between several speech research groups working with us in this area, allowing online collaboration and distribution of assets to be performed. This paper details the current CorpVis interface into our corpora, and the work performed to achieve this. |
|
Multimedia Ontology Life Cycle Management with the SALERO Semantic Workbench Paper (pdf) T. Bürger (LFUI) Workshop on Semantic Multimedia Database Technologies (SeMuDaTe2009) at 4th International Conference on Semantic and Digital Media Technologies (SAMT 2009) (December 2009, Graz, Austria) Ontologies are gaining increased importance in the area of multimedia retrieval or management as they try to overcome the commonly known drawbacks of existing multimedia metadata standards for the descriptions of the semantics of multimedia content. In order to build and use ontologies, user have to receive appropriate support. This paper presents the SALERO Semantic Workbench which oers a set of services to engineer and manage ontologies throughout their life cycle, i.e., from their (semi-) automatic creation through its storage and use in annotation and search. |
|
|
Vocate: Auditory Interfaces for the LOK8 Project Paper (pdf) J. McGee, C. Cullen (DIT) 9th Annual Conference on Information Technology and Telecommunication (October 2009, Dublin, Ireland) The auditory modality has a number of unique advantages over other modalities, such as a fast neural processing rate and focus-independence. As part of the LOK8 project’s aim to develop location-based services, the Vocate module will be seeking to exploit these advantages to augment the overall usability of the LOK8 interface and also to deliver scalable content in scenarios where the user may be in transit or requires focus-independence. This paper discusses these advantages and outlines three possible approaches that the Vocate module may take within the LOK8 project: speech interfaces, auditory user interfaces, and sonification. |
|
|
Concept, Content and the Convict Paper (pdf) M. Tuomola, T. Korpilahti J. Pesonen, A. Singh (TAIK), R. Villa, P. Swamy, Y. Feng, J. Jose (UG) ACM International Conference on Multimedia (October 2009, Beijing, China) This paper describes the concepts behind and implementation of the multimedia art work Alan01 / AlanOnline, which wakes up the 1952 criminally convicted Alan Turing as a piece of code within the art work - thus fulfilling Turing's own vision of preserving human consciousness in a computer. The work's context is described within the development of associative storytelling structures built up by interactive user feedback via an image and video retrieval system. The input to the retrieval system is generated by Alan01 / AlanOnline via their respective sketch interfaces, the output of the retrieval system being fed back to Alan01 / AlanOnline for further processing and presentation to the user within the context of the overall artistic experience. This paper, in addition to presenting the productions and image retrieval system, also presents the installation and online production user reception and some of the issues and observations made during the development of the systems. |
|
|
SALERO Intelligent Media Annotation & Search Paper (pdf) W. Weiss, W. Halb (JRS) T. Bürger (LFUI) R. Villa, P. Swamy (UG) International Conference on Semantic Systems (I-Semantics 2009) (September 2009, Graz, Austria) Currently the media production domain lacks efficient ways to organize and search for media assets. Ontology based applications have been identified as a viable solution to this problem, however, sometimes being too complex for non-experienced users. We present the SALERO Intelligent Media Annotation & Search system which provides an integrated view onto results retrieved from different search engines. Furthermore, it offers a powerful, yet user-friendly Web-based environment to organize and search for media assets. |
|
|
GTM-URL Contribution to the INTERSPEECH 2009 Emotion Challenge Paper (pdf) S. Planet, I. Iriondo, J.C. Socoró, C. Monzo, J. Adell (URL) 10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009) (September 2009, Brighton, United Kingdom) This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge. Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the classifier sub-challenge, for which we tested several classification strategies. On the whole, the results were slightly worse than or similar to the baseline, but we found some configurations that could be considered in future implementations. |
|
Vocate: Auditory Interfaces for Location-based Services Paper (pdf) J. McGee, C. Cullen (DIT) Physicality Workshop (pdf) at 23rd Conference on Computer Human Interaction (HCI 2009) (September 2009, Cambridge, United Kingdom) This paper discusses work being carried out by the Vocate module of the LOK8 project. The LOK8 project seeks to develop location-based services within intelligent social environments, such as museums, art galleries, office buildings, and so on. It seeks to do this using a wide range of media and devices employing multiple modalities. The Vocate module is responsible for the auditory aspect of the LOK8 environment and will seek to exploit the natural strengths afforded by the auditory modality to make the LOK8 system user-friendly in multiple scenarios, including instances where the user needs to be hands-free or eyes-free, or when screen size on a mobile device might be an issue. We look at what kinds of services the Vocate module will be seeking to implement within the LOK8 environment and discuss the strengths and weaknesses of three possible approaches - sonification, auditory user interfaces, and speech interfaces. |
|
SALERO: Semantic AudiovisuaL Entertainment Reusable Objects Paper (pdf) G. Thallinger, G. Kienast (JRS) O. Mayor (UPF) C. Cullen (DIT) R. Hackett (BLITZ) J. Jose (UG) International Broadcasting Conference (IBC 2009) (September 2009, Amsterdam, The Netherlands) Broadcasters around the world are in desperate need to automate content production as much as possible. This need is twofold on the one side automatic production for good structured program parts is needed on the other hand production for different target devices is an issue. The EC project SALERO has developed a range of tools enabling automatic template based production of animation clips with virtual presenters over the past years. In this paper we describe the workflow devised by the project to automate major parts of media production based on 3D content. This is accompanied by a description by the individual tools developed and examples from the experimental productions implemented with this tools. |
|
|
Intelligent Media Annotation & Search Poster (pdf) W. Weiss, G. Thallinger (JRS) Reasoning Web 2009 Summer School (August 2009, Brixen, Italy) Presentation of the Semantic Annotation Tool for intelligent media annotation and search. |
|
Emotional Speech Corpus Creation, Structure, Distribution and Re-Use Paper (pdf) B. Vaughan, C. Cullen (DIT) 1st Young Researchers Workshop in Speech Technology (YRWST 2009) (April 2009, Dublin, Ireland) This paper details the on-going creation of a natural emotional speech corpus, its structure, distribution, and re-use. Using Mood Induction Procedures (MIPs), high quality emotional speech assets are obtained, analysed, tagged (for acoustic features), annotated and uploaded to an online speech corpus. This method structures the corpus in a logical and coherent manner, allowing it to be utilized for more than one purpose, ensuring distribution via a URL and ease of access through a web browser. This is vital to ensuring the reusability of the corpus by third party’s and third party applications. |
|
Voice Processing and Synthesis by Performance Sampling and Spectral Models Dissertation (pdf) J. Bonada (UPF) Dissertation at the Pompeu Fabra University (2008, Barcelona, Spain) Singing voice is one of the most challenging musical instruments to model and imitate. Along several decades much research has been carried out to understand the mechanisms involved in singing voice production. In addition, from the very beginning of the sound synthesis techniques, singing has been one of the main targets to imitate and synthesize, and a large number of synthesizers have been created with that aim. The goal of this thesis is to build a singing voice synthesizer capable of reproducing the voice of a given singer, both in terms of expression and timbre, sounding natural and realistic, and whose inputs would be just the score and the lyrics of a song. This is a very difficult goal, and in this dissertation we discuss the key aspects of our proposed approach and identify the open issues that still need to be tackled. This dissertation substantially contributes to the field of singing voice synthesis: a) it critically discusses spectral processing techniques in the context of singing voice modeling, and provides significant improvements to the current state of the art; b) it applies the proposed techniques to other application contexts such as real-time voice transformations, museum installations or video games; c) it develops the concept of synthesis based on performance sampling as a way to model the sonic space produced by a performer with an instrument, focusing on the specific case of the singing voice; d) it proposes and implements a complete framework for singing voice synthesis; e) it explores the sonic space of the singing voice and proposes a procedure to model it; f) it discusses the issues involved in the creation of the synthesizer’s database and provide tools to automate its generation; g) it performs a qualitative evaluation of the synthesis results, comparing those to the state of the art and to real singer performance; h) it implements all the research results into an optimized software application for singing voice analysis, modeling, transformation and synthesis, including tools for database creation; i) a significant part of this research has been incorporated to a commercial singing voice software by Yamaha Corp. |
|
Automatic refinement of an expressive speech corpus assembling subjective perception and automatic classification Paper (available for purchase) I. Iriondo, S. Planet, J. Socoró, E. Martínez, F. Alías, C. Monzo (URL) SPECOM - Speech Communication (December 2008) This paper presents an automatic system able to enhance expressiveness in speech corpora recorded from acted or stimulated speech. The system is trained with the results of a subjective evaluation carried out on a reduced set of the original corpus. Once the system has been trained, it is able to check the complete corpus and perform an automatic pruning of the unclear utterances, i.e. with expressive styles which are different from the intended corpus. The content which most closely matches the subjective classification remains in the resulting corpus. An expressive speech corpus in Spanish, designed and recorded for speech synthesis purposes, has been used to test the presented proposal. The automatic refinement has been applied to the whole corpus and the result has been validated with a second subjective test. |
|
The Impact of 3D On the Future of Gaming A. Oliver (BLITZ) 3D Entertainment Summit (December 2008, Los Angeles, USA) Blitz Games Studios demonstrated a world first with a live demonstration of a true stereoscopic high quality interactive game running on current generation videogames consoles. Previously this had not been considered possible. SALERO tools supported the rapid development of the demonstrator game content. |
|
The SALERO Virtual Character Ontology Paper (pdf) T. Bürger (LFUI) P. Hofmair, G. Kienast (JRS) Workshop on Semantic 3D Media at SAMT 2008 - Third International Conference on Semantic and Digital Media Technologies (December 2008, Koblenz, Germany) The SALERO project observed a lack of ontologies for the description and annotation of characters in media production. In this field ontologies could be used to support media asset management, information retrieval, automated production or reuse. This paper presents the SALERO Virtual Character Ontology which can be used to describe and annotate characters in media production and game design to support aforementioned scenarios. |
|
Emotional Speech Corpora for Analysis and Media Production Paper (pdf) C. Cullen, B. Vaughan, S. Kousidis, J. McAuley (DIT) SAMT 2008 - Third International Conference on Semantic and Digital Media Technologies (December 2008, Koblenz, Germany) Research into the acoustic correlates of emotional speech as part of the SALERO project has led to the construction of high quality emotional speech corpora, which contain both IMDI metadata and acoustic analysis data for each asset. Research into semi-automated, re-usable character animation has considered the development of online workflows based on speech corpus assets that would provide a single point of origin for character animation in media production. In this paper, a brief description of the corpus design and construction is given. Further, a prototype workflow for semi-automated emotional character animation is also provided, alongside a description of current and future work. |
|
|
Adaptación del CTH-URL para la Competición Albazyn 2008 Paper (pdf, Spanish language) C. Monzo, L. Formiga, J. Adell, I. Iriondo, F. Alías, J. Socoró (URL) V Jornadas en Tecnología del Habla (JTH2008) – ALBAYZIN-08 System Evaluation Proposal (November 2008, Bilbao, Spain) In this work we describe the text-to-speech synthesis system presented to the Albayzin 2008 evaluation. The system follows the classic unit concatenation diagram based on corpus. The selection costs have been adjusted by means of genetic algorithms based method and no prosody prediction has been used. Two systems, with different waveform generation algorithm, were built, selecting one of them from a perceptual test. |
|
|
Procedimiento para la Medida y al Modificación del Jitter y del Shimmer Aplicado a la Síntesis del Habla expresiva Paper (pdf, Spanish language) C. Monzo, I. Iriondo, E. Martínez (URL) V Jornadas en Tecnología del Habla (JTH2008) (November 2008, Bilbao, Spain) This work presents a new procedure to measure voice quality parameters (VoQ), jitter and shimmer. This new procedure takes into account the prosody contained in the sentence, so its effect is reduced before carrying out the measure for each parameter. In addition, in order to conduct the measure in a more reliable way, these parameters will be modified to be used in expressive speech synthesis. Finally, an evaluation is performed using a CMOS perceptual test on four expressive styles: aggressive, happy, sensual and sad; sentences generated by a text-to-speech synthesis system using a prosody modelling module, and in this way the utility of these parameters in different situations is studied. |
|
|
Pitching to Partner: How to Match Industry and Research Needs Keynote Presentation M. Matthews, J. Webb (BLITZ) CGames 2008 - 13th International Conference on Computer Games (November 2008, Wolverhampton, United Kindom) This keynote presentation opened a 2-day conference aimed at stimulating debate about and sharing advances in computer games technologies. The event also aims to support researchers to refine their ideas and find new avenues for further exploration. Mary Matthews gave an overview of the business of making games and where R&D sits within current business models – explaining that it brings both benefits and risks. There are benefits in that it can support a company to innovate but there may also be risks that servicing the project constrains the company’s resources and starves its core business. She gave pointers on structuring proposals to academic partners wishing to engage industry in research projects and used SALERO as an example of a successful collaboration. Jolyon Webb gave a technical and artistic overview of SALERO R&D and a live demonstration of procedural generation of characters. He explained the industry drivers behind the adoption of intelligent content creation techniques and showed how SALERO is enabling a ‘Work Smarter, Not Harder’ approach. |
|
|
Video Redundancy Detection In Rushes Collection Paper (pdf) R. Ren, P. Punitha, J. Jose (UG) ACM Multimedia 2008 (October 2008, Vancouver, Canada) The rushes is a collection of raw material videos. There are various redundancies, such as rainbow screen, clipboard shot, white/black view, and unnecessary re-take. This paper develops a set of solutions to remove these video redundancies as well as an effective system for video summarisation. We regard manual editing effects, e.g. clipboard shots, as differentiators in the visual language. A rushes video is therefore divided into a group of subsequences, each of which stands for a re-take instance. A graph matching algorithm is proposed to estimate the similarity between re-takes and suggests the best instance for content presentation. The experiments on the Rushes 2008 collection show that a video can be shortened to 4%-16% of the original size by redundancy detection. This significantly reduces the complexity in content selection and leads to an effective and efficient video summarisation system. |
|
Metadata Visualisation Techniques for Emotional Speech Corpora Paper (pdf) C. Cullen, B. Vaughan, S. Kousidis, J. McAuley (DIT) Second International Workshop on Adaptive Information Retrieval (AIR 2008) (October 2008, London, United Kingdom) Our research in emotional speech analysis has led to the construction of dedicated high quality, online corpora of natural emotional speech assets. Once obtained, the annotation and analysis of these assets was necessary in order to develop a database of both analysis data and metadata relating to each speech act. With annotation complete, the means by which this data may be presented to the user online for analysis, retrieval and organization is the current focus of our investigations. Building on an initial web interface developed in Ruby on Rails, we are now working towards a visually driven GUI built on Adobe Flex. This paper details our work towards this goal, defining the rationale behind development and also demonstrating work achieved to date. |
|
New Media: a narrative approach to content annotation Paper (pdf) J. McAuley, C. Cullen (DIT) Irish Media Research Network National Conference (IMRN 2008) (September 2008, Maynooth, Ireland) Recent years have seen an upsurge in the popularity of user-generated content. Sites such as Youtube and Flickr have illustrated that increasing numbers of web users are willing to publicly share their content, while equally sites such as Delicious and Blinklist demonstrate that growing numbers of users are willing to annotate each other’s content. Annotation, in this context, comes primarily under the guise of social tagging whereby users apply labels to resources in a subjective yet non-restrictive approach to subject-based indexing. |
|
Towards measuring continuous acoustic feature convergence in unconstrained spoken dialogues Paper (pdf) S. Kousidis, D. Dorran, B. Vaughan, C. Cullen (DIT) Interspeech 2008 (September 2008, Brisbane, Australia) Acoustic/prosodic feature (a/p) convergence has been known to occur both in dialogues between humans, as well as in human-computer interactions. Understanding the form and function of convergence is desirable for developing next generation conversational agents, as this will help increase speech recognition performance and naturalness of synthesized speech. Currently, the underlying mechanisms by which continuous and bi-directional convergence occurs are not well understood. In this study, a direct comparison between time-aligned frames shows significant similarity in acoustic feature variation between the two speakers. The method described (TAMA) constitutes a first step towards a quantitative analysis of a/p convergence. |
|
|
Wide-Band Harmonic Sinusoidal Modeling Paper (pdf) J. Bonada (UPF) DAFx-08 - 11th International Conference on Digital Audio Effects (September 2008, Espoo, Finland) In this paper we propose a method to estimate and transform harmonic components in wide-band conditions, out of a single period of the analyzed signal. This method allows estimating harmonic parameters with higher temporal resolution than typical Short Time Fourier Transform (STFT) based methods. We also discuss transformations and synthesis strategies in such context, focusing on the human voice. |
|
|
The Need for Formalizing Media Semantics in the Games and Entertainment Industry Paper (pdf) T. Bürger (LFUI) Journal of Universal Computer Science (August 2008, Volume 14, Issue 10) The digital media and games industry is one of the biggest IT based industries worldwide. Recent observations therein showed that current production workflows may be potentially improved as multimedia objects are mostly created from scratch due to insufficient reusability capacities of existing tools. In this paper we provide reasons for that, provide a potential solution based on semantic technologies, show the potential of ontologies, and provide scenarios for the application of semantic technologies in the digital media and games industry. |
|
|
Extending Voice-Driven Synthesis to Audio Mosaicing Paper (pdf) J. Janer, M. de Boer (UPF) 5th Sound and Music Computing Conference (August 2008, Berlin, Germany) This paper presents a system for controlling audio mosaicing with a voice signal, which can be interpreted as a further step in voice-driven sound synthesis. Compared to voice-driven instrumental synthesis, it increases the variety in the synthesized timbre. Also, it provides a more direct interface for audio mosaicing applications, where the performer voice controls rhythmic, tonal and timbre properties of the output sound. In a first step, voice signal is segmented into syllables, extracting a set of acoustic features for each segment. In the concatenative synthesis process, the voice acoustic features (target) are used to retrieve the most similar segment from the corpus of audio sources. We implemented a system working in pseudo-realtime, which analyzes voice input and sends control messages to the concatenative synthesis module. Additionally, this work raises questions to be further explored about mapping the input voice timbre space onto the audio sources timbre space. |
|
Temporal Attention Fusion For Sports Event Detection Paper (pdf) R. Ren, Y. Feng, J. Jose (UG) The 5th International Conference on Visual Information Engineering (August 2008, Xi'an, China) The employment of psychological measurement, attention, alleviates the semantic uncertainty around video events and leads to an effective general event detection approach. This paper proposes a multi-resolution autoregressive framework to estimate a unified attention curve from multi-modality salient features at different temporal resolutions. The highlights of this work are: (1) the capability of using data at very coarse temporal resolutions, e.g. three minutes; (2) the robustness against noise caused by modality asynchronism and feature collection size; and (3) the utilisation of Markovian temporal constrains on content presentation. This approach achieved 100% goal event coverage in the football video collection of the FIFA World Cup 2002, 2006 and UEFA League 2006. |
|
Rule-Based Scene Boundary Detection for Semantic Video Segmentation Paper (pdf) Y. Feng, R. Ren, J. Jose (UG) The 5th International Conference on Visual Information Engineering (August 2008, Xi'an, China) In this paper, we present a novel method for semantic video segmentation by using both low-level features and high-level rules on videos and managing it in a hierarchical structure of key-frame, shot and scene. Features in color domain is calculated and utilized for detecting the key-frames and estimating the similarity between shots. By applying the predefined high-level rules, similar shots are merged and the scene boundaries are determined. Finally, a likelihood function is designed for improving the accuracy of scene boundary results. Experimental results from several Hollywood movies have demonstrated and show a better performance of both precision and recall has been achieved comparing with other existing works. |
|
A User Centered Annotation Methodology for Multimedia Content Paper (pdf) T. Bürger, C. Ammendola (LFUI) 5th European Semantic Web Conference (June 2008, Tenerife, Spain) Fully automated solutions for semantic annotation of multi-media content still do not deliver satisfying results. Most manual ontology-based annotation approaches are not suitable for end users who are not experienced with navigating huge ontologies or extending the ontologies used to annotate. We thus present an annotation methodology which supports the user in the aforementioned tasks. This lowers the entry-barrier for non-experienced users to produce ontology based annotations and thus could be used in situations in which annotation should happen just-in-time during the creation of the media which is being annotated. |
|
A Benefit Estimation Model for Ontologies Paper (pdf) T. Bürger (LFUI) 5th European Semantic Web Conference (June 2008, Tenerife, Spain) Predicting the economic value of ontologies is important for their use in productive environments. The measurement of the economic value of information systems usually consists of an assessment of its costs and benefits. While methods for cost estimation for ontology engineering have already been proposed, no method to quantify the benefits of the use of ontologies exists. We thus propose a method for benefit estimation that can be applied to ontologies based on a multiple gap model for user information satisfaction analysis. Together with cost estimation methods this model can be used to predict the economic value of ontologies. |
|
|
Emotional speech corpus construction, annotation and distribution Paper (pdf) C. Cullen, B. Vaughan, S. Kousidis (DIT) Language Resources and Evaluation Conference (LREC 2008) (May 2008, Marakech, Morocco) This paper details a process of creating an emotional speech corpus by collecting natural emotional speech assets, analysisng and tagging them (for certain acoustic and linguistic features) and annotating them within an on-line database. The definition of specific metadata for use with an emotional speech corpus is crucial, in that poorly (or inaccurately) annotated assets are of little use in analysis. This problem is compounded by the lack of standardisation for speech corpora, particularly in relation to emotion content. The ISLE Metadata Initiative (IMDI) is the only cohesive attempt at corpus metadata standardisation performed thus far. Although not a comprehensive (or universally adopted) standard, IMDI represents the only current standard for speech corpus metadata available. The adoption of the IMDI standard allows the corpus to be re-used and expanded, in a clear and structured manner, ensuring its re-usability and usefulness as well as addressing issues of data-sparsitiy within the field of emotional speech research. |
|
|
Towards Intelligent Assembly of Media Assets for Automated Character Animation Paper (pdf) M. Hausenblas, R. Mörzinger, P. Hofmair, W. Haas (JRS) 1st Workshop on Multimedia Annotation and Retrieval enabled by Shared Ontologies (December 2007, Genova, Italy) Creating character animations manually is an expensive and laborious task. In this work we analyse the current, manual workflow of creating character animations. We derive requirements for an automated process, and propose to utilise linked open datasets for context management, along with ontologies to assemble and reuse character animations. First experiences with the prototypical implementation of the context manager are reported. |
|
|
TRECVid 2007 - High Level Feature Extraction Experiments at JOANNEUM RESEARCH Paper (pdf) R. Mörzinger, G. Thallinger (JRS) TRECVid Evaluation Workshop (November 2007, Gaithersburg, USA) This paper describes our experiments for the high level feature extraction task in TRECVid 2007. We submitted the following five runs:
Our submission made use of support vector machines based on a variety of image and video features. The results of the experiments show that four out of five runs achieved a performance above the TRECVid median, including a run with 18 out of 20 evaluated high level features equal or above the median compared with inferred average precision. The mean inferred average precision of our baseline run is 0.056. Early fusion performed slightly better than late fusion on average, although the latter produced more scores above the TRECVid median. The experiment on concept correlation generally impaired the performance and outscored the baseline only for a few features. Heuristic low-level feature combinations displayed a rather poor performance. We assume that the good baseline is due to the effective grounding of a variety of low-level visual features and the generalization capability of the SVM framework with high-dimensional feature spaces. |
|
Why Real-World Multimedia Assets Fail to Enter the Semantic Web Paper (pdf), Presentation (pdf) T. Bürger (LFUI), M. Hausenblas (JRS) Semantic Authoring, Annotation and Knowledge Markup Workshop (October 2007, Whistler, Canada) Making multimedia assets on the one hand first-class objects on the Semantic Web, while keeping them on the other hand conforming to existing multimedia standards is a non-trivial task. Most proprietary media asset formats are binary, optimized for streaming or storage. However, the semantics carried by the media assets are not accessible directly. In addition, multimedia description standards lack the expressiveness to gain a semantic understanding of the media assets. There exists an array of requirements regarding media assets and the Semantic Web, already. Based on a critical review of these requirements we investigate how ontology languages fit into the picture. We finally analyse the usefulness of formal accounts to describe spatio-temporal aspects of multimedia assets in a practical context. |
|
|
The Need for Formalizing Media Semantics in the Games and Entertainment Industry Paper (pdf) T. Bürger (LFUI), H. Zeiner (JRS) I-MEDIA '07 - 1st International Conference on New Media Technology (September 2007, Graz, Austria) The digital media and games industry is one of the biggest IT based industries worldwide. Recent observations therein showed that current production workflows may be potentially improved as multimedia objects are mostly created from scratch due to insufficient reusability capacities of existing tools. In this paper we provide reasons for that, provide a potential solution based on semantic technologies, show the potential of ontologies, and provide scenarios for the application of semantic technologies in the digital media and games industry. |
|
|
Annotating Music Collections: How content-based similarity helps to propagate labels Paper (pdf) M. Sordo, C. Lauriel, O. Celma (UPF) ISMIR 2007 - 8th International Conference on Music Information Retrieval (September 2007, Vienna, Austria) In this paper we present a way to annotate music collections by exploiting audio similarity. Similarity is used to propose labels (tags) to yet unlabeled songs, based on the content-based distance between them. The main goal of our work is to ease the process of annotating huge music collections, by using content-based similarity distances as a way to propagate labels among songs. We present two different experiments. The first one propagates labels that are related with the style of the piece, whereas the second experiment deals with mood labels. On the one hand, our approach shows that using a music collection annotated at 40% with styles, the collection can be automatically annotated up to 78% (that is, 40% already annotated and the rest, 38%, only using propagation), with a recall greater than 0.4. On the other hand, for a smaller music collection annotated at 30% with moods, the collection can be automatically annotated up to 65% (e.g. 30% plus 35% using propagation). |
|
|
Discriminating Expressive Speech Styles by Voice Quality Parameterization Paper (pdf) C. Monzo, F. Alías, I. Iriondo, X. Gonzalvo, S. Planet (URL) ICPhS07 - International Congress of Phonetic Sciences (August 2007, Saarbrücken, Germany) In this work, the capability of voice quality parameters to discriminate among different expressive speech styles is analyzed. To that effect, the data distribution of these parameters, directly measured from the acoustic speech signal, is used to train a Linear Discriminant Analysis that conducts an automatic classification. As a result, the most relevant voice quality patterns for discriminating expressive speech styles are obtained for a diphone and triphone Spanish speech corpus with five expressive speaking styles: neutral, happy, sad, sensual and aggressive. |
|
|
Expressive Speech Corpus Validation by Mapping Subjective Perception to Automatic Classification Based on Prosody and Voice Quality Paper (pdf) I. Iriondo, S. Planet, F. Alías, J.C. Socoró, F. Alías, C. Monzo, E. Martínez, E. (URL) ICPhS07 - International Congress of Phonetic Sciences (August 2007, Saarbrücken, Germany) This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis, due to this kind of emotional speech can be rather lacking in authenticity. The goal is to obtain an automatic classifier able to prune the bad utterances - from an expressiveness point of view. The results of a previous subjective test are used for training a multistage emotional identification system based on statistical features computed from the speech prosody and voice quality. Finally, the system provides a set of utterances to be checked and definitely eliminated if appropriate. |
|
Task-Based Mood Induction Procedures for the Elicitation of Natural Emotional Responses B. Vaughan, S. Kousidis, and Ch. Cullen (DIT) CCCT 2007 - The 5th International Conference on Computing, Communications and Control Technologies (July 2007, Orlando, USA) |
|
|
Validation of an Expressive Speech Corpus by Mapping Automatic Classification to Subjective Evaluation Book chapter (from Springer) I. Iriondo, S. Planet, J.C. Socoró, F. Alías, E. Martínez (URL) IWANN 2007 - 9th International Work-Conference on Artificial Neural Networks (June 2007, San Sebastián, Spain) This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis. The use of acted speech can be rather lacking in authenticity and therefore its expressiveness validation is required. The goal is to obtain an automatic classifier able to prune the bad utterances -with wrong expressiveness-. Firstly, a subjective test has been conducted with almost ten percent of the corpus utterances. Secondly, objective techniques have been carried out by means of automatic identification of emotions using different algorithms applied to statistical features computed over the speech prosody. The relationship between both evaluations is achieved by an attribute selection process guided by a metric that measures the matching between the misclassified utterances by the users and the automatic process. The experiments show that this approach can be useful to provide a subset of utterances with poor or wrong expressive content. |
|
|
Extracting User Preferences by GTM for aiGA Weight Tuning in Unit Selection Text-to-Speech Synthesis Book chapter (from Springer) Ll. Formiga, F. Alías (URL) IWANN 2007 - 9th International Work-Conference on Artificial Neural Networks (June 2007, San Sebastián, Spain) Unit-selection based Text-to-Speech synthesis systems aim to obtain high quality synthetic speech by selecting previously recorded units. These units are selected by a dynamic programming algorithm guided through a weighted cost function. Weights should be tuned by means of perception from listening users to obtain proper quality. In previous works we have proposed to subjectively tune these weights through an interactive evolutionary process, also known as Active Interactive Genetic Algorithm. The problem comes out when different users, although being consistent, evolve to different weight configurations. In this proof-of-principle work, we introduce GTM as a method to extract knowledge from user specific preferences. The experiments show that GTM is able to capture user preferences, thus, avoiding selecting the best evolved weight configuration by means of a new preference test. |
|
|
Enhancing CBIR Through Feature Optimization, Combination and Selection Paper (pdf, available to IEEE subscribers) X. Hilaire, J. Jose (UG) CBMI 2007. International Workshop on Content-Based Multimedia Indexing (June 2007, Bordeaux, France) We present a Content-Based Image Retrieval (CBIR) method based on the combination and selection of several image features. The novelty of our approach over existing methods is threefold: we provide a statistical optimization of the similarity distance for each feature; we replace certain features by a selection in a non-linear expansion of them; and we perform a linear combination of the features. We demonstrate superior capabilities of our method in certain cases over support vector machines (SVM) on a COREL image collection. |
|
|
Simulated testing of an adaptive multimedia information retrieval system Paper (pdf) F. Hopfgartner, J. Urban, R. Villa, J. Jose (UG) CBMI 2007. International Workshop on Content-Based Multimedia Indexing (June 2007, Bordeaux, France) The Semantic Gap is considered to be a bottleneck in image and video retrieval. One way to increase the communication between user and system is to take advantage of the user’s action with a system, e.g. to infer the relevance or otherwise of a video shot viewed by the user. In this paper we introduce a novel video retrieval system and propose a model of implicit information for interpreting the user’s actions with the interface. The assumptions on which this model was created are then analysed in an experiment using simulated users based on relevance judgements to compare results of explicit and implicit retrieval cycles. Our model seems to enhance retrieval results. Results are presented and discussed in the final section. |
|
|
HMM-Based Spanish Speech Synthesis Using CBR as F0 Estimator Paper (pdf) X. Gonzalvo, I. Iriondo, J.C. Socoró, F. Alías, C. Monzo (URL) NOLISP 2007 - An ISCA Tutorial and Research Workshop on NOn LInear Speech Processing (May 2007, Paris, France) Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speech from trained statistical models where spectrum, pitch and durations of basic speech units are modelled altogether. The aim of this work is to describe a Spanish HMM-TTS system using CBR as a F0 estimator, analysing its performance objectively and subjectively. The experiments have been conducted on a reliable labelled speech corpus, whose units have been clustered using contextual factors according to the Spanish language. The results show that the CBR-based F0 estimation is capable of improving the HMM-based baseline performance when synthesizing nondeclarative short sentences and reduced contextual information is available. |
|
|
Objective and Subjective Evaluation of an Expressive Speech Corpus I. Iriondo, S. Planet, J.C. Socoró, F. Alías (URL) NOLISP 2007 - An ISCA Tutorial and Research Workshop on NOn LInear Speech Processing (May 2007, Paris, France) This paper presents the validation of the expressiveness of an acted oral corpus produced to be used in speech synthesis. Firstly, an objective validation has been conducted by means of automatic emotion identification techniques using statistical features extracted from the prosodic parameters of speech. Secondly, a listening test has been performed with a subset of utterances. The relationship between both objective and subjective evaluations is analyzed and the obtained conclusions can be useful to improve the following steps related to expressive speech synthesis. |
|
VAMP: Semantic Validation for MPEG-7 Profile Descriptions Technical Report (pdf) R. Troncy (Centrum voor Wiskunde en Informatica), W. Bailer, M. Hausenblas, M. Höffernig (JRS) Technical report published by Centrum voor Wiskunde en Informatica, INS - Information Systems (April 2007, Amsterdam, Netherlands) MPEG-7 can be used to create complex and comprehensive metadata descriptions of multimedia content. Since MPEG-7 is defined in terms of an XML schema, the semantics of its elements has no formal grounding. In addition, certain features can be described in multiple ways. MPEG-7 profiles are subsets of the standard that apply to specific application areas and that aim to reduce this syntactic variability, but they still lack formal semantics. We propose an approach for expressing the semantics explicitly by formalizing the constraints of various profiles using ontologies and logical rules, thus enabling interoperability and automatic use for MPEG-7 based applications. We have implemented VAMP, a full semantic validation service that detects any inconsistencies of the semantic constraints formalized. Another contribution of this paper is an analysis of how MPEG-7 is practically used. We report on experiments about the semantic validity of MPEG-7 descriptions produced by numerous tools and projects and we categorize the most common errors found. |
|
|
Prosody Modelling of Spanish for Expressive Speech Synthesis i. Iriondo, J.C. Socoró, F. Alías (URL) ICASSP'07 - International Conference on Acoustic, Speech, and Signal Processing (April 2007, Hawaii, USA) This paper presents the use of analogical learning, in particular case-based reasoning, for the automatic generation of prosody from text, which is automatically tagged with prosodic features. This is a corpus-based method for quantitative modelling of prosody to be used in a Spanish text to speech system. The main objective is the development of a method for predicting the three main prosodic parameters: the fundamental frequency (F0) contour, the segmental duration and energy. Both objective and subjective experiments have been conducted in order to evaluate the accuracy of our proposal. |
|
Content-Based Audio Search: From Fingerprinting to Semantic Audio Retrieval Dissertation (pdf) P. Cano (UPF) Dissertation at the Pompeu Fabra University (2007, Barcelona, Spain) This dissertation is about audio content-based search. Specifically, it is on exploring promising paths for bridging the semantic gap that currently prevents wide deployment of audio content-based search engines. Music search sound engines rely on metadata, mostly human generated, to manage collections of audio assets. Even though time-consuming and error-prone, human labeling is a common practice. Audio content-based methods, algorithms that automatically extract description from audio files, are generally not mature enough to provide the user friendly representation that users demand when interacting with audio content. Mostly, content-based methods provide low-level descriptions, while high-level or semantic descriptions are beyond current capabilities. |
|
Spectral Processing of the Singing Voice Dissertation (pdf) A. Loscos (UPF) Dissertation at the Pompeu Fabra University (2007, Barcelona, Spain) This dissertation is centered on the digital processing of the singing voice, more concretely on the analysis, transformation and synthesis of this type of voice in the spectral domain, with special emphasis on those techniques relevant for music applications. The digital signal processing of the singing voice became a research topic itself since the middle of last century, when first synthetic singing performances were generated taking advantage of the research that was being carried out in the speech processing field. Even though both topics overlap in some areas, they present significant differentiations because of (a) the special characteristics of the sound source they deal and (b) because of the applications that can be built around them. More concretely, while speech research concentrates mainly on recognition and synthesis; singing voice research, probably due to the consolidation of a forceful music industry, focuses on experimentation and transformation; developing countless tools that along years have assisted and inspired most popular singers, musicians and producers. The compilation and description of the existing tools and the algorithms behind them are the starting point of this thesis. |
|
|
SALERO: Semantic Audiovisual Entertainment Reusable Objects Paper (pdf), Poster (pdf) W. Haas, G. Thallinger (JRS), P. Cano (UPF), Ch. Cullen (DIT), T. Bürger (LFUI) 1st International Conference on Semantic and Digital Media Technologies - SAMT 2006 (December 2006, Athens, Greece) The Integrated Project SALERO aims to advance the state of the art in digital media to the point where it becomes possible to create audiovisual content for cross-platform delivery using intelligent content tools, with greater quality at lower cost, to provide audiences with more engaging entertainment and information at home or on the move. SALERO will build on and extend research in media technologies, web semantics and context based image retrieval, to reverse the trend toward ever-increasing cost of creating media. |
|
Modelado y estimación de la prosodia mediante razonamiento basado en casos (Modelling and Estimation of Prosody by Means of Case-Based Reasoning) Paper (pdf, Spanish language) I. Iriondo, J.C. Socoró, L. Formiga, X. Gonzalvo, F. Alías, P. Miralles (URL) IV Jornadas en Tecnología del Habla (November 2006, Zaragoza, Spain) This paper presents the use of analogical learning, in particular case-based reasoning, for the automatic generation of prosody from text, which is automatically tagged with prosodic features. This is a corpus-based method for quantitative modeling of prosody to be used in a Spanish text to speech system. The main objective is the development of a method for predicting the three main prosodic parameters: the fundamental frequency (F0) contour, the segmental duration and energy. Both objective and subjective experiments have been conducted in order to evaluate the accuracy of our proposal. |
|
Estudio de Heurísticas para la implementación de A* en CTH basados en selección de unidades (Heuristics for Implementing the A* Algorithm for Unit Selection TTS Synthesis Systems) Paper (pdf, Spanish language) L. Formiga, F. Alías (URL) IV Jornadas en Tecnología del Habla (November 2006, Zaragoza, Spain) The Unit Selection based Text to Speech Systems (USTTS) need to perform an optimal search of units in a speech-corpus, hence in order to obtain a high-quality synthesis. This search, until nowadays, has been carried out by a Viterbi algorithm. Our work replaces the formerly used algorithm for the A* algorithm to enhance its computational efficiency. With that goal, a review of previous work that intend this substitution is detailed. Afterwards, a benchmark is defined to score its efficiency and results are analyzed to validate, in the last step, its theoretical argumentation. |
|
|
Generation of High Quality Audio Natural Emotional Speech Corpus using Task Based Mood Induction Paper (pdf) Ch. Cullen, B. Vaughan, S. Kousidis, Y. Wang, C. McDonnell, D. Campbell (DIT) 1st International Conference on Multidisciplinary Information Sciences and Technologies (October 2006, Mérida, Spain) Detecting emotional dimensions in speech is an area of great research interest, notably as a means of improving human computer interaction in areas such as speech synthesis. In this paper, a method of obtaining high quality emotional audio speech assets is proposed. The methods of obtaining emotional content are subject to considerable debate, with distinctions between acted and natural speech being made based on the grounds of authenticity. Mood Induction Procedures (MIP’s) are often employed to stimulate emotional dimensions in a controlled environment. This paper details experimental procedures based around MIP 4, using performance related tasks to engender activation and evaluation responses from the participant. Tasks are specified involving two participants, who must co-operate in order to complete a given task within the allotted time. Experiments designed in this manner also allow for the specification of high quality audio assets (notably 24bit/192Khz), within an acoustically controlled environment, thus providing means of reducing unwanted acoustic factors within the recorded speech signal. Once suitable assets are obtained, they will be assessed for the purposes of segregation into differing emotional dimensions. The most statistically robust method of evaluation involves the use of listening tests to determine the perceived emotional dimensions within an audio clip. In this experiment, the FeelTrace rating tool is employed within user listening tests to specify the categories of emotional dimensions for each audio clip. |
|
|
The Use of Task Based Mood-Induction Procedures to Generate High Quality Emotional Assets. Poster (pdf) B. Vaughan, Ch. Cullen, S. Kousidis , Y. Wang , C. McDonnell, D. Campbell (DIT) IT&T - Information Technology and Telecommunications Conference (October 2006, Carlow, Ireland) Detecting emotion in speech is important in advancing human-computer interaction, especially in the area of speech synthesis. This poster details experimental procedures based on Mood Induction Procedure 4, using performance related tasks to engender natural emotional responses in participants. These tasks are aided or hindered by the researcher to illicit the desired emotional response. These responses will then be recorded and their emotional content graded to form the basis of an emotional speech corpus. This corpus will then be used to develop a rule-set for basic emotional dimensions in speech. |
|
Groovator - An Implementation of Real-Time Rhythm Transformations Paper (pdf) J. Janer, J. Bonada, S. Jordà (UPF) 121st AES Convention (October 2006, San Francisco, USA) This paper describes a real-time system for rhythm manipulation of polyphonic audio signals. A rhythm analysis module extracts information of tempo and beat location. Based on this rhythm information, we apply different transformations: Tempo, Swing, Meter and Accent. This type of manipulation is generally referred as Content-based transformations. We address characteristics of the analysis and transformation algorithms. In addition, user interaction plays also an important role in this system. Tempo variations can be controlled either by tapping the rhythm with a MIDI interface or by using an external audio signal such as percussion or the voice as tempo control. We will conclude pointing out several use-cases, focusing on live performance situations. |
|
Esophageal Voice Enhancement by Modeling Radiated Pulses in Frequency Domain Paper (pdf) A. Loscos, J. Bonada (UPF) 121st AES Convention (October 2006, San Francisco, USA) Altough esophageal speech has demonstrated to be the most popular voice recovering method after laryngectomy surgery, it is difficult to master and shows a poor degree of intelligibility. This article proposes a new method for esophageal voice enhancement using speech digital signal processing techniques based on modeling radiated voice pulses in frequency domain. The analysis-transformation-synthesis technique creates a non-pathological spectrum for those utterances featured as voiced and filters those unvoiced. Healthy spectrum generation implies transforming the original timbre, modeling harmonic phase coupling from the spectral shape envelope, and deriving pitch from frame energy analysis. Resynthesized speech aims to improve intelligibility, minimize artificial artifacts, and acquire resemblance to patient’s pre-surgery original voice. |
|
A Corpus with Teeth Presentation (pdf) D. Campbell, M. Meinardi, B. Richardson, C. Mcdonnell (DIT) EUROCALL Conference (September 2006, Granada, Spain) ReCALL Journal (Vol 19, No. 1, January 2007, University of Hull, United Kingdom) This paper outlines the ongoing construction of a speech corpus for use by applied linguists and advanced EFL/ESL students. The first section establishes the need for improvements in the teaching of listening skills and pronunciation practice for EFL/ESL students. It argues for the need to use authentic native-to-native speech in the teaching/learning process so as to promote social inclusion and contextualises this within the literature, based mainly on the work of Swan, Brown and McCarthy. The second part addresses features of native speech flow which cause difficulties for EFL/ESL students (Brown, Cauldwell) and establishes the need for improvements in the teaching of listening skills. Examples are given of reduced forms characteristic of relaxed native speech, and how these can be made accessible for study using the Dublin Institute of Technology’s slow-down technology, which gives students more time to study native speech features, without tonal distortion. The final section introduces a novel Speech Corpus being developed at DIT. It shows the limits of traditional corpora and outlines the general requirements of a Speech Corpus. This tool - which will satisfy the needs of teachers, learners and researchers - will link digitally recorded, natural, native-to-native speech so that each transcript segment will be linked to its associated sound file. Users will be able to locate desired speech strings, play, compare and contrast them - and slow them down for more detailed study. |
|
|
A Pitch Marks Filtering Algorithm based on Restricted Dynamic Programming Paper (pdf) F. Alías, C. Monzo, J.C. Socoró (URL) InterSpeech2006 -International Conference on Spoken Language Processing (ICSLP) (September 2006, Pittsburgh, USA) In this paper, a generic pitch marks filtering algorithm (PMFA) is introduced in order to achieve reliable and smooth pitch marks from any input pitch tracking or marking algorithm. The proposed PMFA is a simple yet effective filtering process based on restricted dynamic programming, but very helpful for minimizing human intervention when creating large speech corpora. Moreover, this work introduces a novel pitch marking evaluation measure for directly comparing pitch marking algorithms with different location criteria. The experiments demonstrate that the proposed PFMA improves the results of the input state-of-the-art pitch tracking and marking algorithms dramatically. |
|
|
Current Perspectives on Music Technologies & Multimedia Presentation (pdf) G. Holmberg (UPF) ENGAGE 2006 (September 2006, Jakarta, Indonesia) Within a near future, when the analogue radio & TV net is closed down, we will most probably have in our home some kind of digital Home Entertainment Platform/Media Center. And even to a greater extent than today, we will carry with us portable media players & storage devises. A true digital revolution will radically alter our behavior with multimedia objects, such as music & audio. We will have constant access to Internet, with all music & media of all times and origins available. This will necessarily require on the one hand new and advanced methods of search & retrieval. This is the field of MIR (Music Information Retrieval) and Audio Content Analysis. And on the other hand, we have the field of Audio Transformation & Synthesis: you will no longer be restricted to only download & passively press "play". You will be able to interact with media objects, such as play the song in a different key; or slower/faster; suppress vocals and sing-along & you will be able to remix & play around with music and broadcast yourself and easily create new, personalized "versions" of the media object. We believe that the boundary between professional audio & media creation technology and home-entertainment is just about to merge, into an explosion of breath-taking technological developments & human creative power. |
|
|
Transcripción fonética de acrónimos en castellano utilizando el algoritmo C4.5 (Phonetic Transcription of Spanish Acronyms by using C4.5 algorithm) Paper (pdf, Spanish language) C. Monzo, F. Alías, J.A. Morán, X. Gonzalvo (URL) XXII Congreso de la SEPLN (September 2006, Zaragoza, Spain) This work presents an automatic acronyms transcription system in order to increase the synthetic speech quality of text-to-speech systems, in the presence of acronyms in the input text. The acronyms transcription is conducted by using a decision tree (C4.5 algorithm). The work presents the results obtained for different algorithm configurations, validating its performance with respect to other learning systems. |
|
|
Letting the Corpus Speak Presentation (pdf) D. Campbell (DIT) IVACS - Inter Varietal Corpus Studies (June 2006, Limerick, Ireland) This presentation outlines the current state of development of DIT’s nascent speech corpus. This will allow a body of spoken material to be searched for features of informal native speech via a normalised transcription. Once located, the original sound files can be played at normal speed or slowed down in order to better study the speech act itself. That this aspect of language learning has been neglected for decades has frequently been lamented by natural language specialists such as Richard Cauldwell. |
|
Let the Corpus Speak! Presentation (pdf) D. Campell (DIT) 40th IATEFL Annual Conference and Exhibition (April 2006, Harrogate, United Kingdom) This presentation contrasts existing corpora with the novel Speech Corpus being developed at DIT. It points up the limits of existing - written, and even spoken - corpora and outline the general requirements of a Speech Corpus. This tool - which will satisfy the needs of teachers, learners and researchers- will link digitally recorded, natural, native-native speech acts (in WAV format) with their idealised, orthographic transcriptions. The transcriptions can be fed through a concordancer, with each transcript segment linked to its associated sound file. The segments will also be be tagged for speed of delivery, which will allow users to locate the desired speech strings, play them, compare and contrast them, and - if necessary - slow them down for more detailed study. |
|
|
SALERO: Semantic Audiovisual Entertainment Reusable Objects Abstract (pdf), Poster (pdf) W. Haas, G. Thallinger (JRS) 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (November 2005, London, United Kingdom) Ever since the idea of convergence was floated, the media industry has been talking about cross-platform exploitation as a way of producing more exciting content more cost-effectively. But while technology has helped to produce better quality sounds and images, the costs continue to rise. It is virtually impossible to re-use items from previous productions (regardless of issues of copyright) in different contexts, as the majority of sounds and images only work in the context and media type for which they were originally made. |