Language and Learning Technology

Language Research and Technology

Flinders language research technology is largely focussed around two areas, getting computers to learn language, and getting computers to teach language, but there are many more aspects than these.

Computational Psycholinguistics and NLL

Our research in language learning seeks to balance two different goals:

  1. to model theories of human language acquisition (Psycholinguistics);
  2. to develop better Natural Language Processing Systems (NLL).

Prof. David Powers is one of the pioneers in the area of Natural Language Learning and sought to establish this balance as Founding President of the peak body and conference in this area, SIGNLL and CoNLL. Currently, NLL is dominated by statistical models and a Computer Science orientation, but SIGNLL and CoNLL continue to make efforts to connect with the Psycholinguistic community and encourage computational modeling of psychological theories of language acquisition, and our research remains strongly focussed on psychological and biological plausibility, not just for their own sake but as heuristics that help us to bias our learning algorithms in ways that will lead to human like language and social behaviour.

Our orientation towards approaches that address both of these goals lead to us having a particular focus and expertise in the application of unsupervised learning techniques. This endeavour faces major challenges, including a movement within Linguistics that claims that language is not learned but innate, on the basis of "Poverty of the Stimulus". Two of the factors that are missed by these opponents of the possibility of language learning are Grounding, Idiolect, and Anticipation. 

Grounding refers to the impossibility of really understanding language without understanding the world in which we are embedded, and which we talk about. For this reason we work with Intelligent Robots , Simulated Robots or Enheaded Embodied Conversational Agents, including in particular the concept of a Hybrid World that is also an essential part of our language teaching system.

Idiolect refers to the fact that no two people actually learn exactly the same language. In fact, children negotiate a language that works for them across all contexts that they are in. While they are at home with their mother, they start to learn their mother tongue, their mother's language, but as soon as they are in contact with other people, including other children, and in particular siblings, and most especially twins, they adapt their language and invent new language as part of integrating into each of those subcommunities.

Anticipated Learning was a mechanism proposed independently by David Powers and Chris Turk and is discussed in detail in their joint monograph (1989). Powers proposed a requirement to have separate Recognition and Production Models to allow self-correction to operate, whilst Turk proposed the term "Anticipated Learning" for a theory in which previously heard utterances can trigger a kind of negative feedback that is missing according to "Poverty of the Stimulus". If a sentence is only slightly beyond the limits of our current language competence, it is optimal for learning and we seem to be able to start to hook in the new components.  If we have made a slight grammatical or lexical error and produce an incorrect sentence, most of us will be aware that we become aware of it and self-correct.  In fact the sentence need not even be grammatically wrong, it may just be semantically ambiguous.  Our hearing/understanding model recognizes this on hearing the sentence, if not before, and either amends it on the fly, or adds a rider to disambiguate – or makes a joke of it (pun not intended).

Our research group on NLL/Computational Psycholinguistics thus has a focus not just on AI/NLP applications but specifically on psychologically plausible models and unsupervised learning, as well as Computational Cognitive Linguistics models that depend on linguistic mechanisms and learning algorithms based on the idea of similarity, or metaphor and paradigm.  However, the focus on unsupervised learning also is important because supervised learning techniques cannot meaningfully be applied when there is no agreed gold standard theory that can be used to create data to train on – not to mention that if we had an agreed believed theory there would be no need for the unsupervised psycholinguistic modelling. This lack of gold standard theory and data also highlights the difficulty of evaluating the results of unsupervised learning – thus evaluation becomes another major focus of our group. 

Information Retrieval

One of the main application areas for language technology is web search or information retrieval. Currently most research and all commercial search engines are based purely on statistics and have no real understanding of the text or its grammar: no syntax, no semantics, no ontology. Our research group in Information Retrieval and Visualization, also addresses an other issue – that currently there has been little success in replacing the list of hits with more visual graphical representations.

Interestingly current approaches to evaluation of supervised or hand crafted natural language processing systems are based on evaluation techniques drawn from Information Retrieval, in particular Precision, Recall and their derivatives (in particular F-Measure and Rand Accuracy, which are different kinds of averages of Precision and Recall). These techniques are even problematic in Information Retrieval, when there is no known set relevant documents to use for calculating Recall, and thus the only statistic that can be calculated is Precision. Even for fixed datasets for which a set of relevant documents has been determined by a set of human annotators, the set of relevant documents is something of a fuzzy concept dependent on individual preferences and judgement, but at least an estimate of the true negatives (correctly eliminated irrelevant documents) and false negatives (incorrectly eliminated relevant documents) can be estimated. 

An important aspect of Information Retrieval is understanding the semantic distinctions that language conveys, and determining similarity of documents, words and phrases in terms of meaning, not just words.  This thus connects our work on learning syntax (Leibbrandt and Powers, 2010-12) to our work on learning semantic relationships and understanding word similarity (Yang and Powers, 2005-10).


Entwisle and Powers (1998) showed there are severe problems with the use of Precision, Recall and their means when distributions are skewed, as they often are in Information Retrieval, or in semantic or syntactic labelling of words. In fact, it is possible to bias even a system operating at chance level to get very high values of all of these statistics. In particular, if 90% of the examples are positive, then guessing everything is positive gives you 90% Precision, 100% Recall with all possible averages lying in the range 90 to 100%.

Powers (2003) formally derived an Informedness measure that accurately estimates the probability (or the proportion of the time) that a system is making an informed decision rather than a chance level guess, and generalized it to the multiclass case under the sobriquet Bookmaker due to it "handicapping" the value of a decision the same way as in racing. In a series of subsequent papers, other techniques have been critiqued, in attempt to determine what value if any they have. Some equivalent measures have been found for the two-class case, but no existing measure delivers such a meaningful probablity for the general multiclass case: although ROC is a useful technique which is closely related to Informedness in the dichotomous case but AUC has very special properties that measure something other than the performance of the specific application (Powers, 2012bc); Kappa is also not a bad technique, but can be shown to be misleading when the learner bias differs significantly from the prevalences (Powers, 2012a), as in the case detailed above (Entwisle and Powers, 1998).

For unsupervised evaluation, none of the above techniques are directly usable athough Powers (2003) does discuss a heuristic method for comparing supervised classes with unsupervised clusters, and Pfitzner, Leibbrandt and Powers (2009)  undertake a comprehensive review of techniques for comparison of clusterings, or of unsupervised classifications with a gold standard classification.

Teaching Heads

Our Talking Thinking Teaching Heads build on our expertise in Speech and Language technology generally, and applications also extend into the areas of companions for older people, or helpers for students new to the campus.  The actual conversational agent technology currently built into these applications is currently very primitive and template based, and the applications have been chosen because of their amenability to this simplistic technology.

Many of our Talking Thinking Teaching Head application are educational in nature. The original concept of the Teaching Head was to turn around our language learning research and the robot world simulations we developed for simulated grounding of our computational natural language learners.  The same scenes that are useful for teaching a computer English are also useful for teaching student English – but the task is actually a lot easier.  You don't need a complete native speaker competence in the language to teach basic vocabulary and grammar. Moreover, if you don't understand what the student says to you in response to a question, it doesn't matter, because you know what the answer should have been according to the goals of the current lesson and exercise.  This fits very nicely with the template approach that underlies most conversational agents, including the default version of our Thinking Heads.

An obvious development from teaching grammar is to teach phonology, or microphonology – the Teaching Head has the ability to show you exactly what you should have said, and how you should have said it, in any arbitrary face and voice – including yours!  The concept of mirroring your actual speech as you give an answer with your poor learner's accent, and suspect grammar, followed by repeating back to you the correct answer, with a perfect accent, in your own face and voice – this is a key focus of our current work on the Teaching Head, and should provide a huge improvement in the ability of a student to learn the correct pronuncation and accent, as well as learning the syntax and semantics of the language.  Of course, English is not the only language we could teach, and collaborators at TUB, Berlin and BJUT, Beijing are working on German and Chinese speaking versions of the head, as well as being interested in English Teaching Head for German and Chinese students.

However, another twist on the language Teaching Head is to teach English literacy and numeracy to English Speakers, Australians in our case, or indigenous Australians, or new immigrants to Australia. In fact, it was the indigenous Australian communities that got excited about this project, and the aim of the VALIANT project is to help both the community directly with literacy, numeracy and health issues, as well as to assist with the training of indigenous health workers.

A further twist on the language Teaching Head is to teach children with disabilities – our AudioVisual Speech Recognition work has shown that we have the potential to distinguish sounds by lipreading, that are traditionally thought to be indistinguishable, to be part of a singe viseme, so teaching lip reading seemed to be a natural extension.  However, what the deaf community and their health teams really though was a key need which the Teaching Head could help with, was learning social skills.  This very quickly lead to other communities and conditions where teaching social skills is important, and children with Autism Spectrum Disorders are the current focus of the AVAST project.

A related variant of the Teaching Head is a new project in the area of Motivational Interviewing. Here the aim is to help health professionals learn how to assist a patient in recognizing their needs, and developing an achievable plan to overcome them.  Conversely, we can also exemplify the Motivational Interviewing technique, achieving a system that may actually be useful as an adjunct to some of our other Teaching Head systems, for motivating and encouraging people - in particular our Open Day Buddy and Study Buddy are aimed at helping students, whilst other systems including Clevertar Connect and MANA are aimed at helping older people live more independent, meaningful and fulfilling life.

The takes us into the area of Assistive Technologies, which is also a focal area within the Medical Device Research Institute.

For More Information...

For more information on these projects, or if you're interested in joining the group, or enquiring about courses or scholarships, please contact Dr Richard Leibbrandt or Professor David Powers.




Evaluation Evaluation

In any kind of research, a critical consideration is how to evaluate it! This is no more so critical as in Machine Learning since the evaluation is used not only to evaluate the final results, but to train and tune the system in tight feedback loops that run without human intervention.  Unfortunately there is perhaps nowhere where evaluation is done worse or more unsoundly! Hence Evaluation is a major area of research interest for KIT in its own right.

The most commonly used evaluation measures in Machine Learning and Natural Language Processing are Recall, Precision and F-Measure, or sometimes (Rand) Accuracy.  Recall and Precision are borrowed from Information Retrieval and measure what proportion of relevant documents are returned (Recall) and what proportion of returned documents are relevant (Precision).  Since people don't like having to decide which is best based on two measures, F-Measure was introduced as a kind of average of Recall and Precision (a harmonic mean).  These three measures all have two major disadvantages: 1. they ignore how many irrelevant documents were correctly screened out; and 2. they ignore what accuracy may be achieved by chance.  Rand Accuracy takes into account how many relevant documents are returned as well as how many irrelevant documents are filtered out, but still is strongly influenced by chance.

Entwisle and Powers (1998) illustrates the problem with a real example from part of speech tagging in Natural Language Processing, where people were deliberately dumbing down their taggers and parsers to obtain better F-Measure or Accuracy. The example involves the word water which is mostly a noun (say 90% of the time).  100% Recall and 90% Precision, and similarly high F-Measure and Accuracy, may be obtained by saying water is always a noun.  In fact chance level Precision follows the (90%) Prevalence of the tag in the real data, whilst Recall follows the (100%) Bias of the system to the tag.

Powers (2003) introduced the concept of Informedness, the probability that a decision is informed rather than a guess, and it varies from 0 to 1 as information is increasingly employed to improve the accuracy above chance levels.  If the information is taken into account incorrectly by the learner, it can actually go negative too, corresponding to lower than chance performance - which is what these taggers were seeking to avoid, as they tend to do little better than 50:50 in deciding noun:verb. Later publications (2007/2011) introduce Markedness to capture information flow in the opposite direction, as well as relationships to "good" measures such as DeltaP and Correlation, ROC (2012) and Kappa (2012). - these chance-corrected measures equate to Informedness and Markedness in the simple 2-class case when Prevalence = Bias, which also leads to the uncorrected measures Recall, Precision, F-Measure and Accuracy, coinciding. Generally these measures are even less appropriate for the K>2-class case, but Informedness generalizes to Bookmaker (it is best known from using "booking odds" for handicapping races, or as the debiasing system for multiple choice tests).

So far we have considered only systems where we think we know the right answers, and learning where these "right" answers are used to train the learner in a supervised way.  One concern is that the "gold" standards that are used are often wrong, and even the categories or classes used are debatable.  For this reason much of our work focuses on unsupervised learning techniques, such as clustering or co-clustering, as well as techniques for evaluating these unsupervised results either in relation to a standard (Pfitzner, Leibbrandt & Powers, 2009), or by evaluating the utility of a model in an application (often one involving our Talking Thinking Teaching Head). This unsupervised perspective is also relevant for our work on understanding how children learn to understand the world they live in and develp their social, cultural and linguistics abilities.

Highlighted Publications
by Lab members

David M.W. Powers, (2012a). The Problem of Kappa. In Proceedings of EACL 2012. 13th Conference of the European Chapter of the Association for Computational Linguistics 2012. pp. 345-355. [online].
David M.W. Powers, (2012b). The Problem of Area Under the Curve. In Proceedings of 2012 IEEE International Conference of Information Science and Technology. ICIST 2012. pp. 567-573. [online].
David M.W. Powers, (2012c). ROC-ConCert: ROC-Based Measurement of Consistency and Certainty. In Spring World Congress on Engineering and Technology: Proceedings. SCET2012. [online].
David M.W. Powers, (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies, 2(1), pp.37-63. [online].

Leibbrandt, Richard E. & Powers, David M.W. 2012. Robust Induction of Parts-of-Speech in Child-Directed Language by Co-Clustering of Words and Contexts. Conference of the European Chapter of the Association for Computational Linguistics.

Richard E. Leibbrandt  & Powers, David M.W. (2010). Frequent Frames as Cues to Part-of-Speech in Dutch: Why Filler Frequency Matters. Proceedings of the 32nd Annual Conference of the Cognitive Science Society, 2680-2685.

Richard E. Leibbrandt & Powers, David M.W. 2010. Paradigmatic learning from child-directed speech. 3rd UK Cognitive Linguistics Conference.

Darius M. Pfitzner, Leibbrandt, R.E., & Powers, D.M.W. (2009). Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems, 19(3), pp.361-394. [online].

Darius M. Pfitzner,  Treharne, K., & Powers, David M. W., 2008. User Keyword Preference: the Nwords and Rwords Experiments. International Journal of Internet Protocol Technology, 3(3), 149-158.

David M. W. Powers,  2008. Evaluation Evaluation, The 18th European Conference on Artificial Intelligence (ECAI’08), pp843-844.

Dongqiang Yang and Powers, D.M.W. (2006a). Word sense disambiguation using lexical cohesion in the context. In Robert Dale and Cécile Paris, ed. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 21st International Conference on Computational Linguistics. pp. 929-936. [online].
Dongqiang Yang and Powers, D.M.W. (2006b).  Distributional similarity in the varied order of syntactic spaces. In Proceedings of the First International Conference on Innovative Computing, Information and Control. First International Conference on Innovative Computing, Information and Control ((ICICIC'06)). pp. 406-409. [online].
Dongqiang Yang and Powers, D.M.W. (2006c).  Verb similarity on the taxonomy of WordNet. In Petr Sojka, Key-Sun Choi, Christine Fellbaum, Piek Vossen, ed. Proceedings of the Third International WordNet Conference. The Third International WordNet Conference: GWC 2006. pp. 121-128. [online]. [data]

David M.W. Powers (2003). Recall and precision versus the bookmaker. Proceedings of the Joint International Conference on Cognitive Science, 529-534. [online] or

Jim Entwisle and David M. W. Powers (1998). "The Present Use of Statistics in the Evaluation of NLP Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Conference, Sydney, January 1998.

David M. W. Powers (1992), "On the Significance of Closed Classes and Boundary Conditions: Experiments in Lexical and Syntactic Learning". Background and Experiments in Machine Learning of Natural Language, ITK Proceedings 92/1, Tilburg University, Netherlands,  pp.245-266.

David M. W. Powers and Christopher C. R. Turk (1989), Machine Learning of Natural Language, Springer-Verlag (NewYork/Berlin), ISBN 3-540-19557-2/0-387-19557-2

David M. W. Powers (1984), "Natural Language the Natural Way," Computer Compacts, 100-109 (Jul 1984)' abstract also appears in Journal of Symbolic Logic 51(2): 504-505 (1986).

David M. W. Powers (1983), "Neurolinguistics and Psycholinguistics as a Basis for Computer Acquisition of Natural Language," SIGART 84, pp. 29-34 (June 1983).