Facilitating Gaze-based Unknown Word Detection Using Pre-trained Language Models (PLMs)

As non-native speakers learn a new language, encountering words outside one’s vocabulary can impact their reading fluency. Automatically detecting these unknown words can help drive interactive methods to support reading. Previous methods of unknown word detection methods rely on gaze features obtained through eye trackers, with their detection accuracy greatly affected by the accuracy of eye tracking devices. In this work, we present a real-time and high-accuracy unknown word detection method by combining information about the text with gaze data. We utilize pre-trained language models and knowledge grounding to analyze text for unknown word probabilities. We then combine this text data with gaze information through a transformer-based model. The accuracy of our unknown word detection method is 97.6\%, and the F1-score is 71.1\%. The latency is within 1 second. To demonstrate the robustness of our method, we applied our method to another dataset collected with a relatively inaccurate webcam-based eye-tracking system. Our model can achieve the accuracy of 97.3\% and the F1-score of 65.1\% opening the potential for more accessible solutions than previously possible. Finally, we explore the applications space that can be enabled by our method.

Motivation

Automatically detecting these unknown words can help drive interactive methods to support reading.
Eye-tracking devices are expensive and prone to inaccuracies.

Contributions

We propose a real-time gaze-based unknown word detection method that integrates the text information processed by the pre-trained language model. It can achieve the accuracy of 97.6% and the F1-score of 71.1%. The latency is within 1 second.
We demonstrate that our method has the potential to be easily accessed in everyday life through a webcam-based unknown word detection system. The accuracy of our model on the gaze data obtained by webcam is 97.3% and the F1-score is 65.1%.

Method

An encoder-decoder model to encode positional data: The encoder learns to capture the user’s gaze pattern, and the decoder is expected to predict whether a token belongs to an unknown word of the user.
A pre-trained RoBERTa to encode text information: For token-level textual information encoding, we use pre-trained RoBERTa which has the advantage of encoding the token using the surrounding text to establish the context.
Learnable embeddings to encode the knowledge: To further utilize the word-level information, we introduced knowledge including the term frequency, part of speech, and named entity recognition.

Results

Main Result

The accuracy is 97.6% and the F1-score is 71.1%, which is higher than the state-of-arts.
The accuracy on less precise webcam-based gaze data is 97.3%, showing the robustness of our method.

Device	Method	Accuracy	F1-score	Precision	Recall
Eye tracker	Distance Heuristics	76.6	20.6	13.3	45.5
Eye tracker	Fixation Heuristics	82.4	22.9	16.2	39.1
Eye tracker	Logistic Regression	96.6	23.4	15.9	44.5
Webcam	Distance Heuristics	74.1	19.7	12.1	54.0
Webcam	Fixation Heuristics	78.6	21.6	13.8	50.0
Webcam	Logistic Regression	96.7	20.3	13.0	45.6
Eye tracker	Our Method	97.6	71.1	63.3	79.0
Webcam	Our Method	97.3	65.1	60.3	69.7

Ablation Study

It is efficient to use PLMs for better capturing the contextual information of the documents.
Gaze patterns help identify whether a difficult word is an unknown word.

Model	Accuracy	F1-score	Precision	Recall
Our method	97.6	71.1	63.3	79.0
w/o textual encoding	41.8	22.0	13.5	35.2
w/o gaze encoding	97.5	68.5	63.5	74.4
w/o knowledge embedding	97.6	69.2	68.4	70.0

Cross-User Generalizability

The efficacy of PLMs comes from their internal knowledge of identifying tokens in difficult words.
Over-relying on this factor weaken our method’s performance in cross-user settings.

Model	Accuracy	F1-score	Precision	Recall
Cross user	97.3	59.6	52.8	68.4

My Contributions

Proposed this idea, led this project and wrote most of the paper.
Built a web-based pdf reader using pdf.js and React to collect users’ gaze data while reading.
Implemented a transformer-based model to detect unknown words based on gaze and text data.
Showed the robustness of our method on less precise webcam-based gaze data

Share on

Twitter Facebook LinkedIn

Jiexin Ding