Post on 20-Oct-2020
Page 1 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
PRONUNCIATION ASSESSMENT FOR INTELLIGIBILITY REMEDIATION
Utility Patent Specification
by James Salsman, Fort Lupton; and Lance Culnane, Westminster, both of Colorado.
April 22, 2017
BACKGROUND CITATIONS
U.S. Patent Documents:
5,679,001: Russell, et al. (1997) “Children's speech training aid.”
5,920,838: Mostow, et al. (1999) “Reading and Pronunciation Tutor.”
6,634,887: Heffernan, III, et al. (2003) “Methods and Systems for Tutoring Using a
Tutorial Model with Interactive Dialog.”
6,963,841: Handal et al. (2005) “Speech Training Method with Alternative Proper
Pronunciation Database.”
7,752,045: Eskenazi, et al. (2010) “Systems and Methods for Comparing Speech
Elements.”
8,109,765: Beattie, et al. (2012) “Intelligent Tutoring Feedback.”
Page 2 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
8,271,281: Jayadeva, et al. (2012) “Method for Assessing Pronunciation Abilities.”
8,744,856: Ravishankar (2014) “Computer implemented system and method and
computer program product for evaluating pronunciation of phonemes in a language.”
9,520,068: Beattie, et al. (2016) “Sentence Level Analysis in a Reading Tutor.”
Other References:
Chen and Li (2016) “Computer-assisted pronunciation training: From pronunciation
scoring towards spoken language learning,” in Proceedings of the 2016 Asian-Pacific
Signal and Information Processing Association (APSIPA) Annual Summit and
Conference:
http://www.apsipa.org/proceedings_2016/HTML/paper2016/227.pdf
Cole, et al. (1999) “A platform for multilingual research in spoken dialogue systems,” in
Proceedings of the Multilingual Interoperability in Speech Technology Conference
(Leusden, Netherlands.)
http://www.cslu.ogi.edu/people/hosom/pubs/cole_MIST-platform_1999.pdf
Hawkins, J.A.; and Filipović, L. (2012) Criterial Features in L2 English: Specifying the
Reference Levels of the Common European Framework (United Kingdom: Cambridge
University Press.)
https://drive.google.com/open?
id=0B73LgocyHQnfcEVacmZRc2xEQ3VIZ0tkMHNmdjhNOXVsS1VR
Page 3 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
Huggins-Daines, et al. (2006) “Pocketsphinx: A free, real-time continuous speech
recognition system for hand-held devices.” Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
https://www.cs.cmu.edu/~awb/papers/ICASSP2006/0100185.pdf
Kibishi, et al. (2014) “A statistical method of evaluating the pronunciation proficiency/
intelligibility of English presentations by Japanese speakers,” ReCALL (European
Association for Computer Assisted Language Learning) doi:10.1017/
S0958344014000251,
http://www.slp.ics.tut.ac.jp/Material_for_Our_Studies/Papers/shiryou_last/e2014-
Paper-01.pdf
Loukina, et al. (2015) “Pronunciation accuracy and intelligibility of non-native speech,”
in InterSpeech-2015, the Proceedings of the Sixteenth Annual Conference of the
International Speech Communication Association (Dresden, Germany: Educational
Testing Service)
http://www.oeft.com/su/pdf/interspeech2015b.pdf
Panayotov, V., et al. (2015) "LIBRISPEECH: an ASR Corpus Based on Public Domain
Audio Books," Proceedings of the IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2015),
http://www.danielpovey.com/files/2015_icassp_librispeech.pdf
Proceedings of the International Symposium on Automatic Detection of Errors in
Pronunciation Training, June 6–8, 2012, KTH, Stockholm, Sweden.
http://www.speech.kth.se/isadept/ISADEPT-proceedings.pdf
Proceedings of the Workshop on Speech and Language Technology in Education,
Page 4 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
September 4–5, 2015 (Satellite Workshop of Interspeech 2015 and the ISCA Special
Interest Group SLaTE) Leipzig, Germany:
https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf
Ronanki, S.; Salsman, J. and Bo, L. (December 2012) “Automatic Pronunciation
Evaluation And Mispronunciation Detection Using CMUSphinx,” in the Proceedings of
the 24th International Conference on Computational Linguistics (Mumbai, India:
COLING 2012) pp. 61- 67.
http://www.aclweb.org/anthology/W12-5808
Salsman, J. (July 2014) “Development challenges in automatic speech recognition for
computer assisted pronunciation teaching and language learning” in Proceedings of the
Research Challenges in Computer Aided Language Learning Conference (Antwerp,
Belgium: CALL 2014.)
http://talknicer.com/Salsman-CALL-2014.pdf
Computer-Assisted Pronunciation Teaching (CAPT) Bibliography:
http://liceu.uab.es/~joaquim/applied_linguistics/L2_phonetics/CALL_Pron_Bib.html
FIELD OF THE INVENTION
This invention relates to the field of computer-assisted pronunciation training (CAPT)
using automatic speech recognition for language learning, speech language pathology,
and reading tutoring, such as described by Russell, et al. (1997) “Children's speech
training aid,” U.S. Patent 5,679,001. The assessment and remediation of the authentic
intelligibility of learners' spoken language as measured by agreement with panels of non-
expert word transcriptionists including both native and non-native language listeners
Page 5 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
provides substantial advantages over the current state of the art which instead typically
assesses formal pronunciation agreement with a panel of native language listener
pronunciation experts, because those formal mispronunciations are only associated with
16% of measured authentic intelligibility of words, according to Loukina, et al. (2015.)
DISCUSSION OF PRIOR ART
While Kibishi et al. (2014) have demonstrated the achievement of 75% agreement with
authentic word transcription, even earlier work by Ronanki, Salsman, and Bo (2012)
produced open source software implementing means of more precise discrimination
between consequential and incidental errors by allowing accent and dialect adaptation
using physiologically neighboring phones (phonemes and diphones) derived from the
adjacency of vocal tract components, e.g., in the positions and configuration of the lips,
teeth, tongue, jaw, vocal fold cords, nasal flap, and diaphragm.
As stated in Salsman (2014), “To best support language instruction, we have been
developing the use of physiologically neighboring phonemes, i.e., sounds produced with
similar vocal tract articulations, to identify and discern between serious
mispronunciations and incidental errors (Ronanki et al., 2012.) We are using diphones,
i.e. the last half of one phoneme followed by the first half of the next, as alternatives and
supplements to phonemes and triphones for both automatic speech recognition and
pronunciation scoring (Cole, et al., 1999.) We plan to model learner fluency and select
the sequence of self-study practice exercises using cumulative diphone scores. We are
scoring segment durations to indicate syllables and words pronounced too quickly
relative to exemplary pronunciations. We have measured substantial potential
improvements from all of these techniques.
Page 6 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
“The language instructor’s experience of computer-assisted pronunciation assessment can
be enhanced by offering comparisons of students’ utterances to exemplary pronunciations
in ways that illustrate the measurements of physiologically neighboring phonemes,
diphones, and speech segment durations. For example, mispronunciations might be
annotated with International Phonetic Alphabet symbols for both the expected
pronunciation and its physiologically neighboring phoneme which most closely matched
the observed speech. Diphones can be used to highlight difficult phonetic transitions, for
example when two adjacent phonemes are both mispronounced. Duration scoring can
annotate not just words and sub-word segments given insufficient emphasis, e.g. such as
might confuse ‘fourteen’ with ‘forty,’ and can highlight missing glottal stops essential to
discern, for example, ‘harder’ from ‘hard or.’”
Pronunciation assessment and CAPT responses should be based on at least 44 exemplary
pronunciations for each response word or phrase, comprised of both genders, two age
groups such as age 20s and 50s, and, for English, at least eleven geographic regions, in
order to provide for world-wide English accent and dialect adaptation coverage. For
English, such exemplary pronunciations should be recorded from native language
speakers selected from, for example, Australia, Canada, Ireland, New Zealand, South
Africa, London (Standard Southern English), London (Cockney), London (Received
Pronunciation), Birmingham, Cornwall, East Anglia, East Yorkshire, North Wales,
Edinburgh, Ulster, Dublin, Boston, Midwestern US (i.e., in or west of Michigan,
Pennsylvania, Missouri, or New Mexico), New England, New York City, and the
Southern U.S. Gulf Coast region.
Learner analytics (scoring pronunciation for CAPT and grading authentic intelligibility)
may include log-normal means and variances of phoneme, diphone, word, and phrase
acoustic scores and durations, along with cumulative phoneme and diphone scores;
mispronunciations ranked by consequential interference with intelligibility for each word
Page 7 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
in an utterance and for the whole utterance; tonality scores for tonal languages; language
grammar, morphology, and vocabulary criterial feature coverage scores; and subject
matter topic correctness and coverage scores.
The intelligibility scoring system should agree with a panel of non-expert, authentic
native and non-native language word transcriptionists. Beyond the logistic regression of
word intelligibility by such transcriptionists, other machine learning techniques may
include, but are not limited to those of Kibishi et al. (2014), such as symbolic regression,
general and nonlinear regression, classification, artificial neural networks, support vector
machines, learning vector quantization, or self-organizing maps. Quality assurance
should be performed by measuring the extent to which the resulting intelligibility scores
match those of an actual panel of such non-expert native and non-native word
transcriptionists, preferably using blind or double-blind analysis. Transcriptionist data
may be enhanced with automatic spelling correction. Intelligibility determination may be
enhanced with word frequency-based phonological similarity measures of speech
ambiguity.
Learner remediation may include audio and visual feedback using expected and observed
phones and their durations to show vocal tract sagittal sections and front-facing lip static
graphic diagrams and animations along with spoken audio and text describing corrective
vocal tract motions in the learner's preferred language with examples in that language.
OBJECTIVES AND ADVANTAGES
The invention eliminates pronunciation assessment feedback which does not involve a
consequential mispronunciation interfering with the student's authentic intelligibility, and
provides feedback as pair of audio words in the learners' first language, the first
Page 8 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
containing the correct phoneme and the second containing the mistaken sound produced.
To achieve those goals, we collect transcriptions of learner utterances. For example, while
displaying, “Please listen to this phrase and type in the English words you hear,” play this
audio for the phrase: “I'm here on behalf of the Excellence Hotel group.” For this
example, let's say that in the audio, “behalf” was mispronounced as “beh-alf” and
“Excellence” was mispronounced as “Excellent” but everything else was good. The
learner types in the text: “I'm here on behalf of the excellent hotel group.” (I.e., the
transcribing advanced learner gets “behalf” right, but doesn't transcribe Excellence
correctly because it was mispronounced.) The system sees that “Excellence” was not
transcribed correctly, while the SR system reports two mispronunciations. Therefore,
update the database entry for this phrase that a tally for the corresponding phonemes in
“behalf” are inconsequential, but the final phoneme /s/ in “excellence” is consequential if
mispronounced.
After sufficient data is collected, inconsequential mispronunciations can be ignored. The
database of the prompting phrases will have a probability associated with each phoneme
by which we can scale (or "weight" per Figure 2) each mispronunciation's acoustic score
with that probability to establish the cut-of point for the scaled values which will not be
scored as wrong, e.g. by displaying the word as green or yellow instead of orange or red.
Using a recorded audio library for words in each learner's first language containing each
phoneme near the front, instead of showing green/yellow/orange, the audio recording of
an e.g. Spanish word which starts with a /s/ sound can be played. For example, a
recording saying in Spanish audio, “When you said excellence [that target word in
English] you needed the sound that [a Spanish word starting with /s/] starts with, but
instead you pronounced the sound [a Spanish word starting with /t/] starts with. Listen to
what you said. [Playing the audio of the learner's mispronounced word.] You were
supposed to say excellence [the word in English again]. Click replay to hear this again,”
Page 9 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
can be played while displaying the word “Excellence” and, e.g., two buttons labeled
Replay and Continue.
The specific advantageous improvements of the invention include:
Learner analytics: Learners are scored by any combination of the quality and
intelligibility of their phoneme, diphone, syllable, and word production; their word and
phrase comprehension, and their ability to both comprehend and use grammatical forms,
word stem morphology, "can-do" criteria, including for both production and
comprehension, and other criterial aspects of the instructional interactions (please see, for
example, Hawkins and Filipović, 2012.) In addition to accuracy for each of those aspects,
the learner's confidence, effort, and independence are measured too. For example,
confidence can be self-reported, derived from vocal and timing features, or both. Effort
corresponds to the number and duration of attempts to perform exercises. And
independence can be measured by the number and frequency of learner requests for help.
Integrated content development system: Both instructions and peer learners can add to
and extend branching scenario instructional interactions, which are multiple choice
response instructional content, such as is used in the Twine Twee formalism, or "Choose
Your Own Adventure" role-play interactions. This branching scenario instructional
content can be added and removed by editing the database of interactions in a manner
similar to editing a wiki such as Wikipedia or Wiktionary.
Phonetic disambiguation of homographs (and equivalently, heterophones, meaning words
that are spelled identically but pronounced differently, such as the past and future tenses
of the word, “read”) are automatically presented for disambiguation as an integrated part
of the instructional content development subsystem. This allows instructors and peer
learners to code their instructional content prompting response phrases, of which there
Page 10 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
are typically three per branching scenario node, although there can be any natural
number: zero responses ends the instructional interaction module, one response requires
the production of a particular prompted response and two or more choices allow for
transitions to (usually other) nodes.
Part of speech labeling: The instructional interaction development support subsystem
also assists in labeling the part of speech (e.g., noun, verb, article, adjective, conjunction,
preposition, adverb, etc.) of each word of the prompt phrases in new instructional content
to assist with pronunciation assessment for intelligibility remediation.
Peer consensus-based validation of instructional content: Each node and each transition
between nodes in the branching scenario instructional interactions are separately
validated by instructor data entry and review or peer learner review or both.
Caching stand-alone exercises for offline execution: The system network interface
caches both instructional interactions during download and their results in nonvolatile
storage so that the system will still be usable when disconnected from the network, when
downloads or uploads or both are inhibited, so that the entire system can perform in a
manner consistent with stand-alone operation compatible with free, freemium, or paid
content accession models.
Extensible vocabulary: Each of the prompting phrases is composed of one or more
words, each of which is in turn composed of one or more syllables, diphones, and
phonemes. The number and type of words may be increased by length, subject matter,
vocational or other topic, geography, languages, morphological features, and other
aspects.
Extensible prompting phrases: The number and type of prompting phrases associated
Page 11 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
with each of the branching scenario transitions may be increased by length, subject
matter, vocational topic, geographies, languages, grammatical features, "can-do" criteria,
and other criteria and aspects. The branching scenario interaction modules in which the
transitions are contained may similarly be increased by each of those aspects.
Instructional interaction sequencing: Using a registration and sign-in system which
records the learners' proficiency with each phoneme, diphone, word, and other learner
analytics in the system, allows selection of instructional content modules such as
branching scenarios and prompting phrases which the learner needs to practice the most
to be provided to them in sequence. While the sequence is often determined by the
branching scenario interaction transitions, sequencing can also be performed with
adaptive instruction, by selecting prompting phrases based on how much the learner
analytics database indicates that the learner needs to practice words or criterial aspects
contained in the selected phrases.
Collecting exemplar and student pronunciation audio recordings: The instructional
interaction development subsystem also includes support for collecting, evaluating the
authentic intelligibility of, and storing audio recordings from students, instructors, and
paid voice artists.
Collecting transcriptions of both first and subsequent language transcriptionists from
recorded phrases: Both the instructional interactions and the interaction development
system collect transcriptions of the words that both native and foreign speakers can hear
when they listen to recorded audio from instructors, voiceover artists, and learners. Such
transcriptions are scored by the extent to which they match the words that the speaker
was trying to say when recording the audio.
Authentic intelligibility remediation: This groundbreaking technique was developed
Page 12 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
independently by researchers and software engineers in Japan and the U.S. Educational
Testing Service. Please see Kibishi, et al. (2014) and Loukina, et al. (2015.) This
advantage is a monumental improvement over the commercial state of the art, much if
not most of which is two or three substantial generations behind (see Figure 2.) The
invention's specific remediation process emphasizes audio feedback of spoken words in
the learners' first language containing the sounds of the correct and mistaken
pronunciations, as opposed to merely visual feedback alone.
Multiple pass automatic speech recognition: The learner analytics assessment process
includes the temporal endpoints (and thus the duration) and acoustic scores for the words,
syllables, diphones, and phonemes, of each such speech segments in prompting phrases,
using anomalous durations of those segments to guide multiple passes of automatic
speech recognition against audio input using different speech recognition grammars
representing utterance expectations, and different overall endpoints.
Speech-language pathology reporting: The reports, statistics, and alerts produced from
the learner analytics are designed to provide data in the terms, manner, form, order, and
with the information contained in reports familiar to practicing speech-language
pathologists. However, the same reports are also annotated and provided with context
available by, for example, clickable links to additional text, or similar explanatory
information such that the learners themselves, their teachers, parents, school
administrators, and peers can understand and interpret those reports, statistics, and alerts
produced from the analytics database.
BRIEF DESCRIPTION OF DRAWINGS
Figure 1 depicts the databases and dataflow for the voice-response instructional
Page 13 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
application, comprising a client-server networked computer system composed of: (#1) an
integrated instructional interaction development system; (#2) an instructional interaction
database server process and database; (#3) an interaction and prompting phrase selection
server process; (#4) a network connection from the server to the client; (#5) a client
computer system which may include a web browser in which the client software is
implemented; (#6) an instruction delivery application composed of: (#7) an interaction
and prompting phrase section client process, (#8) a display for interaction multimedia and
prompting phrases, (#9) a microphone for speech audio input and recording, and (#10) a
client process to record speech, determine learner analytics; (#11) a network connection
from the client to the server, (#12) a server process to update speech recognition results
and learner analytics; (#13) a learner analytics database server process and database;
(#14) a server process to calculate and update learner analytics results, reports, and
statistics; and (#15) a server process to produce, display, and send reports, statistics, and
alerts.
Figure 2 depicts the motivation for collecting intelligibility transcriptions, as opposed to
text-independent pronunciation assessment or pronunciation assessment based solely on
exemplar pronunciations of students or voiceover talent.
Figure 3 depicts an example use of logistic regression for intelligibility remediation.
Figure 4 depicts the main database records in an asynchronous intelligibility remediation
peer learning and data collection system.
Figure 5 depicts learner analytics-based instructional prompting phrase sequencing and
branching scenario transitions.
Page 14 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
DESCRIPTION OF THE PREFERRED EMBODIMENT
In its preferred embodiment, the invention consists of software modules to extend
software systems such as Moodle, a free open source instructional course management
system, Wikipedia, a free open editable online encyclopedia, Wiktionary, a free open
editable online dictionary, or Wikiversity, a free open editable online instructional course
creation system. The user of such software, who typically intends to learn the meaning,
pronunciation, grammar, morphology, and associated aspects of words and phrases, will
be shown user interface elements to allow audio recording and subsequent evaluation of
the audio phrase.
For example, a Wiktionary user may be presented with buttons labeled "Record," "Stop,"
"Play," "Evaluate," and, "Try in phrase." The Record button would begin storing audio
data from the microphone, perhaps with a visual audio level meter indicator. The Stop
button would terminate the recording, the Play button would allow the learner to listen to
the recording, perhaps to ascertain the loudness of background noise in order to decide
whether to evaluate the recording. The Evaluate button would perform the pronunciation
assessment and determine the intelligibility of the phrase, and use that information to
select, compose, and produce audio or visual feedback or both, for the learner to review
in order to remediate their pronunciation intelligibility issues that could be identified.
Finally, the "Try in phrase" button should provide an opportunity for the learner to
practice the word in a phrase, and may link the user to a registration and sign-in system
which records their proficiency with each phoneme, diphone, word, and phrase in the
system so that the exercises which the learner needs to practice the most can be provided
to them in a sequence beginning with trying to pronounce the word in a phrase.
Page 15 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
OPERATION AND EXPLANATION
One well-known automatic speech recognition system capable of providing the data on
which the processes of the invention rely is the Carnegie Mellon Sphinx Speech
Recognition Project’s PocketSphinx free open source software described in Huggins-
Daines, et al. (2006.) The operation of the PocketSphinx system to provide pronunciation
assessment data is described on this CMUsphinx Wiki page tutorial describing the use of
PocketSphinx for pronunciation evaluation:
https://cmusphinx.github.io/wiki/pocketsphinx_pronunciation_evaluation
One of the most important advances of the invention over essentially all of the prior art is
the use of physiologically nearby neighboring phonemes, which are shown on that wiki
page as the following file encoding the speech recognition results grammar comprised of
the physiologically nearby neighboring phonemes of the word “with,” along with those of
the other phonemes in alphabetical order:
#JSGF V1.0;
grammar neighbors;
public = sil [sil];
= aa | ah | er | ao;
= ae | eh | er | ah;
= ah | ae | er | aa;
= ao | aa | er | uh;
= aw | aa | uh | ow;
= ay | aa | iy | oy | ey;
= b | p | d;
= ch | sh | jh | t;
Page 16 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
= dh | th | z | v;
= d | t | jh | g | b;
= eh | ih | er | ae;
= er | eh | ah | ao;
= ey | eh | iy | ay;
= f | hh | th | v;
= g | k | d;
= hh | th | f | p | t | k;
= ih | iy | eh;
= iy | ih;
= jh | ch | zh | d;
= k | g | t | hh;
= l | r | w;
= m | n;
= ng | n;
= n | m | ng;
= ow | ao | uh | aw;
= oy | ao | iy | ay;
= p | t | b | hh;
= r | y | l;
= sh | s | z | th;
= sh | s | zh | ch;
= t | ch | k | d | p | hh;
= th | s | dh | f | hh;
= uh | ao | uw | uw;
= uw | uh | uw;
= v | f | dh;
= w | l | y;
Page 17 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
= y | w | r;
= z | s | dh | z;
= zh | sh | z | jh;
The phonemes shown above are encoded in the CMUBET phonetic alphabet, which is
described and explained on this wiki page:
https://cmusphinx.github.io/wiki/cmubet
Another important advance of the invention is the use of diphones. A diphone is the last
part of one phoneme followed by the first part of another. There are over 1,000 diphones
in spoken English, but only about 650 of those occur with substantial frequency. English
diphones in the CMUBET phonetic alphabet are explained and listed with their
frequencies on this wiki page:
http://cmusphinx.github.io/wiki/diphones
The use of logistic regression for intelligibility remediation is explained by Figure 3. The
primary database records for asynchronous intelligibility remediation using peer learning
and data collection are depicted in Figure 4. The use of learner analytics for instructional
prompt phrase sequencing and branching scenario transitions are explained by Figure 5.
CONCLUSION
The invention provides better speaking skills instructional software than presently
commercially available from the state of the art. Language students can use thousands of
free web and stand-alone software applications for learning reading, writing, and
Page 18 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
listening. But speaking skills instruction is limited to expensive, cumbersome, and often
inaccurate commercial software for pronunciation assessment. The interactive language
pronunciation assessment and remediation software of the invention may be able to
improve students’ pronunciation of words perhaps six times faster than commercially
available products. Millions of people worldwide currently wish to improve their
pronunciation in order to gain access to better jobs and succeed at more opportunities to
speak in public, on teleconferences, or to groups. Unfortunately, the state of the art often
frustrates students by putting too much emphasis on inconsequential mistakes. The
invention solves those problems by allowing adaptive instruction
While the description above contains many specifics, they should not be considered as
limitations on the scope of the invention, but rather as exemplification of one preferred
embodiment thereof. Many other variations are possible. For example, a children's toy
to teach speaking skills may be provided as a device with a microphone and display, or
the software system may run in internet web browsers as software executed by the
browsers as, for example, program code in the JavaScript computer programming
language. Accordingly, the scope of the invention should be be determined not by the
embodiments as described and illustrated, but by the following claims.
Page 19 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
CLAIMS
What is claimed is:
(1) A networked client-server computer system composed of:
(a) an instructional interaction database server process and database (Figure 1, #2);
(b) an interaction and prompting phrase selection server process (#3);
(c) a network connection from the server to the client (#4);
(d) a client web browser (#5);
(e) an instruction delivery application (#6), composed of:
(e)(1) an interaction and prompting phrase section client process (#7),
(e)(2) a display for interaction multimedia and prompting phrases (#8),
(e)(3) a microphone for speech audio input and recording (#9),
(e)(4) a client process to record speech, determine learner analytics, such as the quality
and intelligibility of the learner’s phoneme, diphone, syllable, and word production; their
word and phrase comprehension; their ability to both comprehend and use grammatical
forms; word stem morphology production and comprehension; "can-do" criteria such as
arbitrary instructional objectives and subject matter; the learner's measured confidence,
effort, and independence; and use those analytics to assess resulting achievement and
Page 20 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
progress scores from the learner’s audio input (#10), and
(f) a network connection from the client to the server (#11);
(g) a server process to update speech recognition results and learner analytics, such as the
quality and intelligibility of the learner’s phoneme, diphone, syllable, and word
production; their word and phrase comprehension; their ability to both comprehend and
use grammatical forms; word stem morphology production and comprehension; "can-do"
criteria, including arbitrary instructional objectives and subject matter; the learner's
measured confidence, effort, and independence; and use those analytics to assess
resulting achievement and progress scores from the learner’s audio input (#12);
(h) a learner analytics database server process and database (#13);
(i) a server process to calculate and update learner analytics results, reports, and statistics
(#14);
(j) a server process to produce, display, and send reports, statistics, and alerts (#15).
(2) The computer system of Claim 1 with an integrated instructional interaction
development system (#1) composed of a means to input, edit, and and extend branching
scenario instructional interactions composed of multiple choice response instructional
content, such as: the Twine (twinery.org) Twee language and "Choose Your Own
Adventure" role-play interactions, which can be added, changed, and removed by editing
a database of interactions in a manner similar to editing a wiki such as Wikipedia or
Wiktionary.
(3) The computer system and instructional interaction development system of Claim 2,
Page 21 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
with a means of phonetic disambiguation of homographs (words that are spelled
identically but pronounced differently) presented to the instructional interaction
developer for disambiguation by selection of alternative pronunciations during input and
editing.
(4) The computer system and instructional interaction development system of Claim 2,
with a means of part of speech (e.g., noun, verb, article, adjective, conjunction,
preposition, adverb, etc.) labeling of each word of the instructional interaction prompting
phrases presented for selection of each word’s part of speech during instructional
interaction input and editing.
(5) The computer system of Claim 1, with a means of peer consensus-based validation of
instructional content composed of a way for learners, instructors, parents, and
administrators to verify and validate each node and each transition between nodes in the
branching scenario instructional interactions are separately validated by instructor data
entry and review or peer learner review or both.
(6) The computer system of Claim 1, with a means of caching stand-alone exercises for
offline execution comprised of a processes reading instructional interactions and
associated data from the system network input interface (#4) which caches instructional
interactions during download, allowing them to be used when the network becomes
disconnected, and a process storing data when the system network output interface their
results in nonvolatile storage so that the system will still be usable when disconnected
from the network, when downloads or uploads or both are inhibited, such that the system
can perform in a manner consistent with stand-alone operation compatible with free,
freemium, or paid content accession models.
(7) The computer system of Claim 1, with a means of extensible vocabulary, composed of
Page 22 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
processes to assist in increasing the number and type of words contained in prompting
phrases by length, subject matter, vocational topic, geography, languages, morphological
features, and other topics and aspects.
(8) The computer system of Claim 1, with a means of extensible prompting phrases and
branching scenario interaction modules, allowing for increasing the number and type of
prompting phrases and branching scenario interaction modules by length, subject matter,
vocational topic, geographies, languages, grammatical features, "can-do" criteria, and
other criteria and aspects.
(9) The computer system of Claim 1, with a means of instructional interaction sequencing
composed of processes for registration and sign-in, a process to allow recording learners'
proficiency with each phoneme, diphone, word, and other learner analytics, and a process
to determine which instructional content modules such as branching scenarios and
prompting phrases that the learner needs to practice the most, and a process to provide
learners those instructional content modules in sequence (Figure 5.)
(10) The computer system of Claim 1, with a means of authentic intelligibility
remediation composed of two processes:
(a) to obtain recorded audio prompting phrase utterances, their transcriptions from native
and foreign language transcriptionists, to create a predictive model of the consequence of
observed mispronunciations as follows:
(a)(1) obtain learner attempts at pronouncing a number of phrases, each associated with a
branching scenario instructional interaction transition in the form of recorded audio;
(a)(2) using the recorded audio attempts, categorize each word as having been transcribed
Page 23 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
either correctly or incorrectly;
(a)(3) using automatic speech recognition, evaluate the pronunciation of the recorded
audio to determine the temporal endpoints and duration, along with the acoustic
confidence probability, and alternative nearby physiologically neighboring speech
segments such as phonemes, diphones, and syllables which may have matched the
recorded audio more closely than the expected segments;
(a)(4) using the recorded audio of each words and the proportion of the time that they
were transcribed correctly, use logistic regression to model the consequence of each
mispronunciation for prediction of the likelihood that the word was correctly transcribed,
from the independent variables produced by the automatic speech recognition results
(Figure 3); and
(a)(5) store the results of the logistic regression predictive model as weight coefficients
for each of the independent variables of each word of each prompting phrase in the
predictive model. And,
(b) to provide learner exercise interaction as follows:
(b)(1) display one or more prompting phrases;
(b)(2) record audio from the learner;
(b)(3) using automatic speech recognition, evaluate the pronunciation of the recorded
audio to determine the temporal endpoints and duration, along with the acoustic
confidence probability, and alternative nearby physiologically neighboring speech
segments such as phonemes, diphones, and syllables which may have matched the
Page 24 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
recorded audio more closely than the expected segments;
(b)(4) scale the results of the automatic speech recognition according to the weights
stored in step (a)(5) to determine the expected probability that each word is intelligible;
(b)(5) rank each of the predicted unintelligible words by consequence according to part of
speech and predictive model probability magnitude;
(b)(6) provide audio or audio and visual feedback to the learner based on their most
consequential pronunciation mistake as expected by the predictive model; and
(b)(7) as part of the audio feedback, replay the learner's most consequential
mispronunciation followed by another two prerecorded audio words, one of which
includes the phoneme or diphone associated with the observed sound constituting the
mispronunciation, followed by a word with the phoneme or diphone associated with the
correct pronunciation.
(11) The computer system of Claim 1, with a means of multiple pass automatic speech
recognition composed of learner analytics assessment processes to determine temporal
endpoints, and thereby the duration, and acoustic scores for speech segments such as
phonemes, diphones, syllables, and words of prompting phrases, wherein anomalous
durations of those segments guide multiple passes of automatic speech recognition of the
same audio input using different speech recognition grammars representing utterance
expectations.
(12) The computer system of Claim 1, with a means of speech-language pathology
reporting composed of surveying current terms used in, the manner of presentation of,
printed forms composing, the order of presentation of, and the information contained in
Page 25 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
reports used by practicing speech-language pathologists, and then formatting reports,
statistics, and alerting messages according to the surveyed descriptions of those reports.
(13) A networked client-server computer system composed of:
(a) an instructional interaction database server process and database (Figure 1, #2);
(b) an interaction and prompting phrase selection server process (#3);
(c) a network connection from the server to the client (#4);
(d) a client web browser (#5);
(e) an instruction delivery application (#6), composed of:
(e)(1) an interaction and prompting phrase section client process (#7),
(e)(2) a display for interaction multimedia and prompting phrases (#8),
(e)(3) a microphone for speech audio input and recording (#9),
(e)(4) a client process to record speech, determine learner analytics, including the quality
and intelligibility of the learner’s phoneme, diphone, syllable, and word production; their
word and phrase comprehension; their ability to both comprehend and use grammatical
forms; word stem morphology production and comprehension; "can-do" criteria such as
arbitrary instructional objectives and subject matter; the learner's measured confidence,
effort, and independence; and use those analytics to assess resulting achievement and
progress scores from the learner’s audio input (#10), and
Page 26 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(f) a network connection from the client to the server (#11);
(g) a server process to update speech recognition results and learner analytics, including
the quality and intelligibility of the learner’s phoneme, diphone, syllable, and word
production; their word and phrase comprehension; their ability to both comprehend and
use grammatical forms; word stem morphology production and comprehension; "can-do"
criteria, including arbitrary instructional objectives and subject matter; the learner's
measured confidence, effort, and independence; and use those analytics to assess
resulting achievement and progress scores from the learner’s audio input (#12);
(h) a learner analytics database server process and database (#13);
(i) a server process to calculate and update learner analytics results, reports, and statistics
(#14);
(j) a server process to produce, display, and send reports, statistics, and alerts (#15);
(k) an integrated instructional interaction development system (#1) composed of a means
to input, edit, and and extend branching scenario instructional interactions composed of
multiple choice response instructional content, such as: the Twine (twinery.org) Twee
language and "Choose Your Own Adventure" role-play interactions, which can be added,
changed, and removed by editing a database of interactions in a manner similar to editing
a wiki such as Wikipedia or Wiktionary;
(l) a means of phonetic disambiguation of homographs (words that are spelled identically
but pronounced differently) presented to the instructional interaction developer for
disambiguation by selection of alternative pronunciations during input and editing;
Page 27 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(m) a means of part of speech (e.g., noun, verb, article, adjective, conjunction,
preposition, adverb, etc.) labeling of each word of the instructional interaction prompting
phrases presented for selection of each word’s part of speech during instructional
interaction input and editing;
(n) a means of peer consensus-based validation of instructional content composed of a
way for learners, instructors, parents, and administrators to verify and validate each node
and each transition between nodes in the branching scenario instructional interactions are
separately validated by instructor data entry and review or peer learner review or both;
(o) a means of caching stand-alone exercises for offline execution comprised of a
processes reading instructional interactions and associated data from the system network
input interface (#4) which caches instructional interactions during download, allowing
them to be used when the network becomes disconnected, and a process storing data
when the system network output interface their results in nonvolatile storage so that the
system will still be usable when disconnected from the network, when downloads or
uploads or both are inhibited, such that the system can perform in a manner consistent
with stand-alone operation compatible with free, freemium, or paid content accession
models.
(p) a means of extensible vocabulary, composed of processes to assist in increasing the
number and type of words contained in prompting phrases by length, subject matter,
vocational topic, geography, languages, morphological features, and other topics and
aspects.
(q) a means of extensible prompting phrases and branching scenario interaction modules,
allowing for increasing the number and type of prompting phrases and branching scenario
Page 28 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
interaction modules by length, subject matter, vocational topic, geographies, languages,
grammatical features, "can-do" criteria, and other criteria and aspects.
(r) a means of instructional interaction sequencing composed of processes for registration
and sign-in, a process to allow recording learners' proficiency with each phoneme,
diphone, word, and other learner analytics, and a process to determine which instructional
content modules such as branching scenarios and prompting phrases that the learner
needs to practice the most, and a process to provide learners those instructional content
modules in sequence (Figure 5.)
(s) a means of authentic intelligibility remediation composed of two processes:
(s)(1) to obtain recorded audio prompting phrase utterances, their transcriptions from
native and foreign language transcriptionists, to create a predictive model of the
consequence of observed mispronunciations as follows:
(s)(1)(a) obtain learner attempts at pronouncing a number of phrases, each associated
with a branching scenario instructional interaction transition in the form of recorded
audio;
(s)(1)(b) using the recorded audio attempts, categorize each word as having been
transcribed either correctly or incorrectly;
(s)(1)(c) using automatic speech recognition, evaluate the pronunciation of the recorded
audio to determine the temporal endpoints and duration, along with the acoustic
confidence probability, and alternative nearby physiologically neighboring speech
segments such as phonemes, diphones, and syllables which may have matched the
recorded audio more closely than the expected segments;
Page 29 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(s)(1)(d) using the recorded audio of each words and the proportion of the time that they
were transcribed correctly, use logistic regression to model the consequence of each
mispronunciation for prediction of the likelihood that the word was correctly transcribed,
from the independent variables produced by the automatic speech recognition results
(Figure 3); and
(s)(1)(e) store the results of the logistic regression predictive model as weight coefficients
for each of the independent variables of each word of each prompting phrase in the
predictive model. And,
(s)(2) to provide learner exercise interaction as follows:
(s)(2)(a) display one or more prompting phrases;
(s)(2)(b) record audio from the learner;
(s)(2)(c) using automatic speech recognition, evaluate the pronunciation of the recorded
audio to determine the temporal endpoints and duration, along with the acoustic
confidence probability, and alternative nearby physiologically neighboring speech
segments such as phonemes, diphones, and syllables which may have matched the
recorded audio more closely than the expected segments;
(s)(2)(d) scale the results of the automatic speech recognition according to the weights
stored in step (s)(1)(e) to determine the expected probability that each word is
intelligible;
(s)(2)(e) rank each of the predicted unintelligible words by consequence according to part
Page 30 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
of speech and predictive model probability magnitude;
(s)(2)(f) provide audio or audio and visual feedback to the learner based on their most
consequential pronunciation mistake as expected by the predictive model; and
(s)(2)(g) as part of the audio feedback, replay the learner's most consequential
mispronunciation followed by another two prerecorded audio words, one of which
includes the phoneme or diphone associated with the observed sound constituting the
mispronunciation, followed by a word with the phoneme or diphone associated with the
correct pronunciation;
(t) a means of multiple pass automatic speech recognition composed of learner analytics
assessment processes to determine temporal endpoints, and thereby the duration, and
acoustic scores for speech segments such as phonemes, diphones, syllables, and words of
prompting phrases, wherein anomalous durations of those segments guide multiple passes
of automatic speech recognition of the same audio input using different speech
recognition grammars representing utterance expectations; and
(u) a means of speech-language pathology reporting composed of surveying current terms
used in, the manner of presentation of, printed forms composing, the order of presentation
of, and the information contained in reports used by practicing speech-language
pathologists, and then formatting reports, statistics, and alerting messages according to
the surveyed descriptions of those reports.
(14) A networked client-server computer system composed of:
(a) an instructional interaction database server process and database (Figure 1, #2);
Page 31 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(b) an interaction and prompting phrase selection server process (#3);
(c) a network connection from the server to the client (#4);
(d) a client web browser (#5);
(e) an instruction delivery application (#6), composed of:
(e)(1) an interaction and prompting phrase section client process (#7),
(e)(2) a display for interaction multimedia and prompting phrases (#8),
(e)(3) a microphone for speech audio input and recording (#9),
(e)(4) a client process to record speech, determine learner analytics, such as the quality
and intelligibility of the learner’s phoneme, diphone, syllable, and word production; their
word and phrase comprehension; their ability to both comprehend and use grammatical
forms; word stem morphology production and comprehension; "can-do" criteria such as
arbitrary instructional objectives and subject matter; the learner's measured confidence,
effort, and independence; and use those analytics to assess resulting achievement and
progress scores from the learner’s audio input (#10), and
(f) a network connection from the client to the server (#11);
(g) a server process to update speech recognition results and learner analytics, such as the
quality and intelligibility of the learner’s phoneme, diphone, syllable, and word
production; their word and phrase comprehension; their ability to both comprehend and
use grammatical forms; word stem morphology production and comprehension; "can-do"
Page 32 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
criteria, including arbitrary instructional objectives and subject matter; the learner's
measured confidence, effort, and independence; and use those analytics to assess
resulting achievement and progress scores from the learner’s audio input (#12);
(h) a learner analytics database server process and database (#13);
(i) a server process to calculate and update learner analytics results, reports, and statistics
(#14);
(j) a server process to produce, display, and send reports, statistics, and alerts (#15).
(15) The computer system of Claim 14 with an integrated instructional interaction
development system (#1) composed of a means to input, edit, and and extend branching
scenario instructional interactions composed of multiple choice response instructional
content, such as: the Twine (twinery.org) Twee language and "Choose Your Own
Adventure" role-play interactions, which can be added, changed, and removed by editing
a database of interactions in a manner similar to editing a wiki such as Wikipedia or
Wiktionary.
(16) The computer system of Claim 14, with a means of caching stand-alone exercises for
offline execution comprised of a processes reading instructional interactions and
associated data from the system network input interface (#4) which caches instructional
interactions during download, allowing them to be used when the network is, as usual,
disconnected, and a process storing data when the system network output interface their
results in nonvolatile storage so that the system will still be usable when disconnected
from the network, when downloads or uploads or both are inhibited, such that the system
can perform in a manner consistent with stand-alone operation compatible with free,
freemium, or paid content accession models.
Page 33 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(17) The computer system of Claim 14, with a means of instructional interaction
sequencing composed of processes for registration and sign-in, a process to allow
recording learners' proficiency with each phoneme, diphone, word, and other learner
analytics, and a process to determine which instructional content modules such as
branching scenarios and prompting phrases that the learner needs to practice the most,
and a process to provide learners those instructional content modules in sequence (Figure
5.)
(18) The computer system of Claim 14, with a means of authentic intelligibility
remediation composed of two processes:
(a) to obtain recorded audio prompting phrase utterances, their transcriptions from native
and foreign language transcriptionists, to create a predictive model of the consequence of
observed mispronunciations as follows:
(a)(1) obtain learner attempts at pronouncing a number of phrases, each associated with a
branching scenario instructional interaction transition in the form of recorded audio;
(a)(2) using the recorded audio attempts, categorize each word as having been transcribed
either correctly or incorrectly;
(a)(3) using automatic speech recognition, evaluate the pronunciation of the recorded
audio to determine the temporal endpoints and duration, along with the acoustic
confidence probability, and alternative nearby physiologically neighboring speech
segments such as phonemes, diphones, and syllables which may have matched the
recorded audio more closely than the expected segments;
Page 34 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(a)(4) using the recorded audio of each words and the proportion of the time that they
were transcribed correctly, use logistic regression to model the consequence of each
mispronunciation for prediction of the likelihood that the word was correctly transcribed,
from the independent variables produced by the automatic speech recognition results
(Figure 3); and
(a)(5) store the results of the logistic regression predictive model as weight coefficients
for each of the independent variables of each word of each prompting phrase in the
predictive model. And,
(b) to provide learner exercise interaction as follows:
(b)(1) display one or more prompting phrases;
(b)(2) record audio from the learner;
(b)(3) using automatic speech recognition, evaluate the pronunciation of the recorded
audio to determine the temporal endpoints and duration, along with the acoustic
confidence probability, and alternative nearby physiologically neighboring speech
segments such as phonemes, diphones, and syllables which may have matched the
recorded audio more closely than the expected segments;
(b)(4) scale the results of the automatic speech recognition according to the weights
stored in step (a)(5) to determine the expected probability that each word is intelligible;
(b)(5) rank each of the predicted unintelligible words by consequence according to part of
speech and predictive model probability magnitude;
Page 35 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
(b)(6) provide audio or audio and visual feedback to the learner based on their most
consequential pronunciation mistake as expected by the predictive model; and
(b)(7) as part of the audio feedback, replay the learner's most consequential
mispronunciation followed by another two prerecorded audio words, one of which
includes the phoneme or diphone associated with the observed sound constituting the
mispronunciation, followed by a word with the phoneme or diphone associated with the
correct pronunciation.
(19) The computer system of Claim 14, with a means of multiple pass automatic speech
recognition composed of learner analytics assessment processes to determine temporal
endpoints, and thereby the duration, and acoustic scores for speech segments such as
phonemes, diphones, syllables, and words of prompting phrases, wherein anomalous
durations of those segments guide multiple passes of automatic speech recognition of the
same audio input using different speech recognition grammars representing utterance
expectations.
(20) The computer system of Claim 14, with a means of speech-language pathology
reporting comprised of surveying current terms used in, the manner of presentation of,
printed forms composing, the order of presentation of, and the information contained in
reports used by practicing speech-language pathologists, and then formatting reports,
statistics, and alerting messages according to the surveyed descriptions of those reports.
Page 36 of 36
salsman-culnane-specification 6/2/17, 4:47 PM
ABSTRACT
This invention is a method of interactive computer-aided instruction for general education
including speaking skills. Learners are asked to read text prompting phrases into a
microphone in response to multiple choice questions. Automatic speech recognition is
used to assess the pronunciation and provide remediation, in the form of audio or visual
responses or both, based on the authentic intelligibility of the learners' spoken responses
determined from transcriptions of other learners' utterances of the same prompting
phrases.
PROVISIONAL PATENT APPLICATION AND DISCLOSURE DOCUMENT
REFERENCES
The forgoing utility patent application specification claims the earlier date of James
Salsman's U.S. provisional patent application of March 4, 2016, entitled, “Pronunciation
Assessment for Intelligibility Remediation.” The delay in filing the present application
beyond the one year statutory limit was unavoidable, but was less than the two month
regulatory exemption for unavoidable delay. The present application also makes reference
to U.S. Patent and Trademark Office Disclosure Document number S00867 filed by
James Salsman on October 23, 1998, entitled, “Solar-powered Portable Reading
Instruction System.”