research trends in barrier-free speech presentation ... · barrier-free speech presentation...

8
Broadcast Technology No.63, Winter 2016 C NHK STRL 8 This article describes the trends in speech presentation technologies used in broadcast media for visually impaired as well as the development of an adaptive speech-rate conversion technology, one of various user-friendly broadcasting technologies that NHK has been working on. We also describe some of the initiatives being taken to implement these technologies in a practical form. 1. Introduction Broadcasting and almost all multimedia content includes an audio component, and in most cases, visually impaired and sighted people are able to access and benefit from the information contained therein. However, in some cases, it is hard to convey the information in the content to visually impaired people. Moreover, even if it could be conveyed, it would only be possible at some disadvantage to visually impaired user; e.g., their physical or psychological difficulties might prevent them from fully understanding the content or it might take a long time for the information to be understood. The issue that barrier-free technologies must strive to overcome is to eliminate such disadvantages for visually impaired people. Among those who experience difficulty obtaining information from visual media are people with a variety of perceptual impairments or characteristics other than just legal blindness or poor vision, e.g., persons with special color perception and sensitivity to changes in light level. These people should have various ways of securing the information they need *1 . Here, thanks to the rapid progress of information and communication technology (ICT), we are entering an age when individuals will be able to obtain the information they need in the form they need. In this regard, barrier-free technologies using ICT are becoming an important means of reducing the information disparity between persons with perceptual impairments and those with normal vision. This article describes research trends in one such barrier-free technology: speech presentation technologies intended mainly for blind and users with poor vision. It also introduces NHK’s initiative for promoting user-friendly broadcasting technologies by describing the development and implementation of an adaptive speech-rate conversion technology that supports visually impaired users. 2. Issues affecting speech communication for visually impaired In the history of telecommunications, communi- cation has been in words transmitted electronically, and the telephone has played a major role. Now, however, as exemplified by the popularity of social networking services (SNS), “silent’ communication through the sending of text and images is engulfing the world. Statistics 1) indicate that mobile phone services now reach almost every corner of the world 2) and that a growing proportion of people are using their phones for purposes other than talking. Large amounts of non-speech information are circulating on networks that were originally intended for speech. As reasons for this trend, consider some of the advantages of converting speech into text: the information exchanged by the sender and receiver does not have to be shared in real time, and it is easier to search and scan. On the other hand, speech is sequential information; it must be apprehended in real time, and it is difficult to shorten its time scale. Research Trends in barrier-free speech presentation technologies for visually impaired *1 Providing an alternate means of obtaining information that would otherwise be difficult to obtain because of visual or hearing disabilities.

Upload: others

Post on 25-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL8

This article describes the trends in speech presentation technologies used in broadcast media for visually impaired as well as the development of an adaptive speech-rate conversion technology, one of various user-friendly broadcasting technologies that NHK has been working on. We also describe some of the initiatives being taken to implement these technologies in a practical form.

1. IntroductionBroadcasting and almost all multimedia content

includes an audio component, and in most cases, visually impaired and sighted people are able to access and benefit from the information contained therein. However, in some cases, it is hard to convey the information in the content to visually impaired people. Moreover, even if it could be conveyed, it would only be possible at some disadvantage to visually impaired user; e.g., their physical or psychological difficulties might prevent them from fully understanding the content or it might take a long time for the information to be understood. The issue that barrier-free technologies must strive to overcome is to eliminate such disadvantages for visually impaired people.

Among those who experience difficulty obtaining information from visual media are people with a variety of perceptual impairments or characteristics other than just legal blindness or poor vision, e.g., persons with special color perception and sensitivity to changes in light level. These people should have various ways of securing the information they need*1. Here, thanks to the rapid progress of information and

communication technology (ICT), we are entering an age when individuals will be able to obtain the information they need in the form they need. In this regard, barrier-free technologies using ICT are becoming an important means of reducing the information disparity between persons with perceptual impairments and those with normal vision.

This article describes research trends in one such barrier-free technology: speech presentation technologies intended mainly for blind and users with poor vision. It also introduces NHK’s initiative for promoting user-friendly broadcasting technologies by describing the development and implementation of an adaptive speech-rate conversion technology that supports visually impaired users.

2. Issues affecting speech communication for visually impaired

In the history of telecommunications, communi-cation has been in words transmitted electronically, and the telephone has played a major role. Now, however, as exemplified by the popularity of social networking services (SNS), “silent’ communication through the sending of text and images is engulfing the world. Statistics1) indicate that mobile phone services now reach almost every corner of the world2) and that a growing proportion of people are using their phones for purposes other than talking. Large amounts of non-speech information are circulating on networks that were originally intended for speech. As reasons for this trend, consider some of the advantages of converting speech into text: the information exchanged by the sender and receiver does not have to be shared in real time, and it is easier to search and scan. On the other hand, speech is sequential information; it must be apprehended in real time, and it is difficult to shorten its time scale.

Research Trends in barrier-free speech presentation technologies for visually impaired

*1 Providing an alternate means of obtaining information that would otherwise be difficult to obtain because of visual or hearing disabilities.

Page 2: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL 9

It is also difficult to search recorded speech for the desired information.

Visually impaired often say that while sighted people can obtain information efficiently by scanning text, they do not have an equivalent means of getting information quickly from speech or audio. Even variable speed playback on audio players in most cases is not as fast (or as efficient) as visual scanning of text. In addition, many people find it tiring to listen to fast playbacks. While a reader can freely change their reading style between scanning and reading, speech does not give this freedom and there is no equivalent to visual scanning when listening. Thus, the inefficiency of acquiring and summarizing information is an issue affecting the usability of speech information regardless of whether it is for visually impaired users or not.

On another front, “Cinema Daisy”3) is a service that lets users listen to movies. Started in 2013, it provides sound recordings that contain explanations of the actions of characters and scene settings together with the original sound track. The Japan Braille Library had 133 Cinema Daisy titles as of August, 2015. Users have said that the accurate explanations given by people familiar with the movies enable them to enjoy movies much more. However, when speech synthesis was used to expand the offerings of this service, there were many negative comments such as “the tone of the speech does not match the diversity in the scenes,” or “it made me lose interest in the good scenes.” This suggests that the quality of the explanations produced by speech synthesis is nowhere near the level of those given by people. Thus, in addition to efficiency, richness of expression is important aspect for visually impaired users. We hope that in the future, varied expressions will be possible by changing the tone of speech synthesis according to the intention of the speech.

3. Barrier-free speech presentation technolo-gies for visually impaired

The usual ways for visually impaired people to understand printed information are talking books4)

and screen readers. Talking books are recordings of people reading books or magazines, and they can be borrowed at Braille libraries and some public libraries. Screen readers use speech synthesis on a personal computer (PC) to read text information in Web

pages, e-mail, or other applications. Recently, screen readers have been developed for mobile terminals, so it is becoming more convenient for visually impaired people to obtain information in mobile environments as well5).

Currently, only about 10% of visually impaired people can read Braille6), so the audio conversion methods are expected to become more and more important in the future.

3.1 Unified standards for talking books and ebooks for visually impaired

Talking books are audio recordings for people that have difficulty reading text, as stipulated in Article 37, Item 3 of the Copyright Act. The titles cover a wide range of genres, from magazines to textbooks. They are not just for visually impaired and can be lent to people with learning or intellectual disabilities who have difficulty reading.

Talking books based on the Digital Accessible Information System (DAISY), which is an international standard for digital talking books, have become popular around the world7)8). Their content is organized so that the desired information can be accessed easily and intuitively by people with visual disabilities.

In recognition of the popularity of eBook’s fast playback function, the DAISY standard has been incorporated the EPUB3 international standard for eBooks9). It is very significant that an accessibility standard intended for a minority, those who are visually impaired, has been incorporated into a general standard for everyone. It shows promise for the improvement and expansion of the environment surrounding Talking Books and eBooks, without differentiating between those with and without disabilities.

3.2 Multi-media DAISY DAISY was originally an audio-only standard, but

it has now grown into an eBook standard capable of synchronized playback of text, video, and sound (Multi-media DAISY). It was developed so that, among visually impaired in particular, children with poor vision and those with learning disabilities that make it difficult for them to read text, can also enjoy reading books. As a result of the “2008 Act to promote the spread of educational books” and

Page 3: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL

revisions to Article 33, Item 2 of the Copyright Act, it has become possible to produce large-type text books*2 and digitized multi-media DAISY textbooks10) for children and students with visual, learning, and other developmental disabilities.

Multi-media DAISY provides ways for users to gain information intuitively from the visual and audio components of content, such as flashing the text background to show clearly the correspondence between the text and what is being read out. It also utilizes the characteristics of digital content to enable properties such as the layout, size, and color of text to be changed freely in order to accommodate persons with varying degrees of disability or at different stages of learning.

3.3 Efficient speech listening technologyAs mentioned in Section 2, people with visual

disabilities need a technology for listening to speech efficiently, and variable-speed sound playback technology can meet that need. The DAISY devices popular in Japan support playback at from 0.5 to three times normal playback speed, and their playback speed can be set in steps. Most of the screen readers used elsewhere support playback at from 0.5 to two times normal speed. Indices of preferred playback speeds have been determined11) from a survey of screen reader use12) and by studying the listening comprehension in visually impaired. However, the requirements vary due to individual differences in listening comprehension and the objective of use, with some users saying that playback that double the speed would be enough if the sound quality could be improved, while others want even faster playback (up to three or even five times, for example). To handle such varied demands, methods for improving the comprehensibility during high-speed playback of recorded speech and synthesized speech have been proposed.

In particular, for human speech, methods have been proposed that dynamically estimate the minute-by-minute aspects of pitch and volume that affect comprehensibility during high-speed playback and ensure that each part is comprehensible13)14). For speech synthesis, a method has been proposed that

identifies which sections in the input text that are linguistically important beforehand and adjusts the speech synthesis to ensure comprehensibility15). The effectiveness of these methods for persons with visual disabilities is being experimentally evaluated. One reason why there is demand for efficient listening methods is that most talking books consist of many hours of recorded speech. Magazines and novels range from approximately five to ten hours, and there are many that are more than 20 hours. Some users claim that it takes more than a month to read one book, which is onerous. An feature that could reduce the time spent listening while ensuring the desired level of comprehension would be the key to making listening “in-a-moment” in lieu of reading “at-a-glance”.

3.4 Developing inclusive applications for persons with and without disabilities

High-speed listening methods such as the one described in Section 3.1 are also suitable for sighted persons. One example is the audio book. Audio books are recorded narrations of novels and nonfiction, lectures, and performances. They have been available in Japan since the 1980s, when they were sold on cassette, and have since gone through a number of changes in media16). With the recent spread of smartphones, audio book titles have begun to appear alongside music downloads. In April, 2015, 16 major publishers in Japan joined together to form the Audio Book Association in an attempt to increase the popularity of this market. If listening to narrated books becomes a popular way of using books, we can also expect a demand for a scan-listening method similar to what is needed by visually impaired users.

This sort of service thus has the potential to be user-friendly for everyone. In the following sections, we introduce some of NHK’s initiatives to implement barrier-free technologies, including services for visually impaired users and examples of speech-rate conversion technology being used in broadcasting.

4. Adaptive speech-rate conversion technology4.1 Principles of adaptive speech-rate conversion

As mentioned in Section 3.3, people with visual disabilities often use high-speed playback functions when listening to talking books and recorded audio and video. As such, it is desirable for the playback

10

*2 Approved textbooks republished with text and figures en-larged for use by children and students with poor vision.

Page 4: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL

Feature

11

Figure 1: Adaptive speech-rate conversion theory (high-speed playback example)

Original speech

Uniform compression

Adaptive speech-rate conversion

By shortening gaps between speech segments, the time frame for the whole content remains the same as when using the base speed

High-speed playback always at the base speed

Start slower than the base speed

Speed gradually increases toward

the end

High-speed playback always at the base speed

Start slower than the base speed

Speed gradually increases toward

the end

method to make the high-speed speech as easy to understand as possible.

Our laboratory has developed an adaptive speech-rate conversion technology17). It slows down fast speech, such as in news casts, so that it is easier to understand for the elderly and others. This technology was commercialized18) between 2002 and 2004, and built into radios and televisions. Besides producing high-quality converted speech, the technology can slow down speech without lengthening the time of the broadcast program. As it is intended for broadcast programs, it is designed to process speech sequentially once it has begun.

Besides the above application, adaptive speech-rate conversion should be able to make high-speed playback easier to comprehend. Suppose, for example, that the time frame the listener wants is half of the original program length. In this case, the overall speed must be double that of normal playback, but the adaptive speech-rate conversion actually makes the converted speech sound slower and more natural than if it were audio sped up uniformly by a factor of two.

The idea behind adaptive speech-rate conversion is shown in Figure 1. It starts slower than the baseline speed (the speed that corresponds to a uniform

compression of the program duration into the target time frame) at the beginning of an utterance and increases to slightly faster than the base speed toward the end. It further reduces time frame to fit into the target value by shortening durations in which nothing is spoken, such as when the speaker is taking a breath. In addition it can further shorten speech by clipping the endings of words where the volume drops off19).

4.2 Improving the performance of adaptive speech-rate conversion

Most of the content handled by adaptive speech-rate conversion is pre-recorded, so by reviewing the overall audio beforehand and optimizing the conversion, we can expect to achieve a smoother effect than the sequential processing described in Section 4.1.

As such, we have developed a technology that analyzes the overall changes in the pitch and volume over time of the whole audio in order to determine the rate for each part13)14), rather than information such as the start of utterances and pauses. For example, parts with higher-pitched and louder speech could be made slower than the baseline rate, while quieter parts could be made faster. Figure 2 shows a case in which the expansion rate is adaptively determined

Page 5: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL12

in this manner (the numbers in the figure indicate expansion rates of each part).

This method uses the characteristics of metre, such as accent and intonation. Such characteristics could also be used to distinguish differences in speech between individuals and/or different languages.

5. Adaptive speech-rate conversion applications5.1 Internet radio news service allowing the listener to select the rate

NHK began offering a new on-demand service19)

on its NHK Online website in March, 2004. The new service allows users to adjust the speech rate of news broadcasts (those delivered in the past 24 hours on Radio 1; see Figure 3). In addition to the “Normal”

Figure 3: Radio news service with selectable speedhttp://www.nhk.or.jp/r-news/

Voice pitch level

Single phrase

Voice volume

High and loud

Uniform:

Adaptive:

High and quiet

Low and quiet

Same time frame but easier to understandAt n-times speed, each part is lengthened by 1/n

1 1 1 1 1 1

1.3 1.2 1 1.3 1.2 0.9

Figure 2: Advanced adaptive speech-rate conversion function

Page 6: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL

Feature

13

rate, users can select a “Slow” rate, which plays back the news in 120 percent of the actual broadcast time, or a “Fast” rate, which plays it back in 60 percent of the actual time. The “Fast” rate uses the adaptive speech-rate conversion discussed in Section 4.1. This Internet service is complementary to broadcasting, and it could not have been implemented within the conventional broadcasting framework.

According to the access records, when the service started, about 70% of users selected the “Normal” rate, and the rest were split nearly equally between “Slow” and “Fast”. However, the proportion of users selecting “Fast” increased over time, so that several years later, it accounted for approximately the same proportion as “Normal”. The proportion selecting “Slow” has remained approximately the same since the beginning. Because the “Fast” setting can be used to understand news content more efficiently, both visually impaired people and the general public actively use it.

5.2 Expanding it usefulnessUsing this type of speech-rate conversion technology

on PCs and mobile terminals would make a variety of new applications possible, and more people would have a chance to experience how effective it is. Accordingly, we created applications of the sequential processing system described in section 4.1 that operates immediately on speech input for Windows OS, iOS and Android.

For Windows PCs, we prototyped an applet that

operates as a plugin for the Windows Media Player (Figure 4). The applet changes the speed of video and speech and synchronizes them. It supports 0.5 to four times normal speed play out and works on video and audio content stored on hard disk or from the Internet.

NHK released a mobile terminal applet called “Gogaku Player” in 2011 (Figure 5). The applet varies the speed between 0.5 and three-times normal. Users can listen to content from “Radio English Conversation” and other language programs at a speed suited to their skill level. It also supports content such as Podcasts*3 and audio books. A survey of users who had used the application for language study indicated that most selected speeds between 0.8 and 1.2 times normal. The technology can be used to play back a variety of content, and many users commented that they hoped the functionality could be extended to playback at four times or higher speed.

In January 2015, this system was incorporated into “Language Reader,” an ebook distributed by NHK. The latest ebook standard (EPUB3) also incorporates the DAISY standard. Thus, the popularization of ebooks for the general public should also lead to an increase in the convenience of visually impaired users who use talking books.

Figure 4: Screenshot of Windows Media Player with built-in adaptive speech-rate conversion function

*3 A mechanism by which mobile players and PCs can load programs through the Internet so they can be watched any-where.

Page 7: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL

6. ConclusionThis article has given an overview of the trends in

speech presentation technologies for visually impaired and introduced NHK’s research and development on speech-rate conversion technology that makes it easier for people to comprehend speech in programs.

Even though there have not been any major changes regarding speech presentation technologies for the visually impaired in the past several years, fact that they can now be used easily on terminals that fit in the palm of one’s hand is, in itself, a major change. This makes it possible to provide and guarantee information using advanced technologies such as the adaptive speech-rate conversion introduced here, as well as speech synthesis and speech recognition. In other words, we have reached an age in which anyone can retrieve information in a form that meets their individual needs. In our laboratory, we intend to continue our work to make broadcasting barrier-free.

The technical developments and applications discussed in sections 4.2 and 5.2 are the products of a collaboration with NHK Engineering Systems Inc.

(Atsushi Imai, Toru Takagi†,†NHK Engineering Systems)

References

1) Ministry of Internal Affairs and Communications: “Voice communication usage in Japan from the perspective of communication volume,” http://www.

soumu.go.jp/menu_news/s-news/01kiban03_02000271. html (in Japanese)

2) Ministry of Internal Affairs and Communications: “2012 Survey of Trends in Telecommunications Use,” http://www.soumu.go.jp/menu_news/s-news/

01tsushin02_02000058.html (in Japanese)3) Nippon Lighthouse Culture Center, http://www.iccb.

jp/onseiguide/cinema_daisy/ (2015) (in Japanese)4) Japan Braille Library Web page, http://www.

nittento.or.jp/about/scene/recording.html (2015) (in Japanese)

5) For example: Android Accessibility, http://eyes-free.googlecode.com/svn/trunk/documentation/android_access/index.html (2015) (in Japanese)

6) Ministry of Health, Labour and Welfare: Results of 2006 Survey of Conditions for Physically Disabled Children and Adults,” http://www.mhlw.go.jp/toukei/saikin/hw/shintai/06/ (in Japanese)

7) DAISY Consortium, http://www.daisy.org (2015)8) ANSI/NISO Z39.98-2012, http://www.daisy.org/

z3998/2012/z3998-2012.html9) DAISY Consortium, http://www.daisy.org/daisy-

epub-3-developments (2015) (in Japanese)10) DAISY Research Center, http://www.dinf.ne.jp/doc/

daisy/book/daisytext.html (in Japanese)11) C. Asakawa, H. Takagi, S. Ino, T. Ifukube: “The

Optimal and Maximum Listening Rates in Presenting Speech Information to the Blind,” Journal of the Human Interface Society, Vol. 7, No. 1, pp. 105-111 (2005) (in Japanese)

12) T. Watanabe: “A Study on Voice Settings of Screen Readers for Visually-Impaired PC Users,” IEICE Journal, D-I, J88-D-I(8), pp. 1257-1260 (2005) (in Japanese)

13) A. Imai, N. Tazawa, Y. Iwahana, T. Takagi, N.

Figure 5: The “Gogaku Player” mobile phone applet with speech-rate conversion function

14

Page 8: Research Trends in barrier-free speech presentation ... · Barrier-free speech presentation technolo-gies for visually impaired The usual ways for visually impaired people to understand

Broadcast Technology No.63, Winter 2016 ● C NHK STRL

Feature

Seiyama, T. Tanaka, T. Ifukube: “Intelligible High-speed Playback Technology Using the Acoustic Features of Speech Prosody,” ITE Journal, Vol. 66, No. 7, pp. 214-220 (2012) (in Japanese)

14) Tazawa et al.: “A fast speech-rate conversion technology to assist efficient information acquisition for visually impaired persons,” IEICE Technical Report, WIT, Vol. 112, No. 223, pp. 57-61 (2012) (in Japanese)

15) Torihara:“’Scan-listening’ System, [’Fast listening’ system for the visually impaired , using syntactic information],” IEICE Technical Report, 5th Conference on Social Welfare Information Engineering, WIT00-28 (2000) (in Japanese)

16) Shincho CD, http://www.shinchosha.co.jp/mediashitsu/ (in Japanese)

17) A. Imai et al.: “An Adaptive Speech-Rate Conversion Method for News Programs without Accumulating Time Delay,” IEICE Journal, A, Vol. J83-A, No. 8, pp 935-945 (2000) (in Japanese)

18) A. Imai, T. Takagi and H. Takeishi: “Development of Radio and Television Receiver with Functions to Assist Hearing of Elderly People,” IEEE Trans. Consumer Electronics, Vol. 51, No. 1, pp. 268-272 (2005)

19) A. Imai, T. Takagi, K. Kurozumi, R. Koyama, T. Shimazu: “A New Internet Radio News Service Using Speech Rate Conversion Technology,” ITE Journal, Vol. 59, No. 2, pp. 265-270 (2005) (in Japanese)

15