IUI '23: Proceedings of the 28th International Conference on Intelligent User Interfaces

Full Citation in the ACM Digital Library

SESSION: Keynotes

Pragmatic Communication with Embodied Agents

With the emergence of a new generation of embodied AI agents (e.g., cognitive robots), it has become increasingly important to empower these agents with the ability to learn and collaborate with humans through language communication. Despite recent advances, language communication in embodied AI still faces many challenges. Human language not only needs to ground to agents’ perception and action but also needs to facilitate collaboration between humans and agents. To address these challenges, I will introduce several efforts in my lab that study pragmatic communication with embodied agents. I will talk about how language use is shaped by shared experience and knowledge (i.e., common ground) and how collaborative effort is important to mediate perceptual differences and handle exceptions. I will discuss task learning by following language instructions and highlight the need for neuro-symbolic representations for situation awareness and transparency. I will further present explicit modeling of partners’ goals, beliefs, and abilities (i.e., theory of mind) and discuss its role in language communication for situated collaborative tasks.

SESSION: Session 1

Powering an AI Chatbot with Expert Sourcing to Support Credible Health Information Access

During a public health crisis like the COVID-19 pandemic, a credible and easy-to-access information portal is highly desirable. It helps with disease prevention, public health planning, and misinformation mitigation. However, creating such an information portal is challenging because 1) domain expertise is required to identify and curate credible and intelligible content, 2) the information needs to be updated promptly in response to the fast-changing environment, and 3) the information should be easily accessible by the general public; which is particularly difficult when most people do not have the domain expertise about the crisis. In this paper, we presented an expert-sourcing framework and created Jennifer, an AI chatbot, which serves as a credible and easy-to-access information portal for individuals during the COVID-19 pandemic. Jennifer was created by a team of over 150 scientists and health professionals around the world, deployed in the real world and answered thousands of user questions about COVID-19. We evaluated Jennifer from two key stakeholders’ perspectives, expert volunteers and information seekers. We first interviewed experts who contributed to the collaborative creation of Jennifer to learn about the challenges in the process and opportunities for future improvement. We then conducted an online experiment that examined Jennifer’s effectiveness in supporting information seekers in locating COVID-19 information and gaining their trust. We share the key lessons learned and discuss design implications for building expert-sourced and AI-powered information portals, along with the risks and opportunities of misinformation mitigation and beyond.

AlphaDAPR: An AI-based Explainable Expert Support System for Art Therapy

Sketch-based drawing assessments in art therapy are widely used to understand individuals’ cognitive and psychological states, such as cognitive impairment or mental disorders. Along with self-report measures based on a questionnaire, psychological drawing assessments can augment information about an individual psychological state. However, the interpretation of the drawing assessments requires much time and effort, especially in a large-scale group such as schools or companies, and depends on the experience of the art therapists. To address this issue, we propose an AI-based expert support system, AlphaDAPR, to support art therapists and psychologists in conducting a large-scale automatic drawing assessment. Our survey results with 64 art therapists showed that 64.06% of the participants indicated a willingness to use the proposed system. The results of structural equation modeling highlighted the importance of explainable AI embedded in the interface design to affect perceived usefulness, trust, satisfaction, and intention to use eventually. The interview results revealed that most of the art therapists show high levels of intention to use the proposed system while expressing some concerns about AI’s possible limitations and threats as well. Discussion and implications are provided, stressing the importance of clear communication about the collaborative role of AI and users.

AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind Users

Web data items such as shopping products, classifieds, and job listings are indispensable components of most e-commerce websites. The information on the data items are typically distributed over two or more webpages, e.g., a ‘Query-Results’ page showing the summaries of the items, and ‘Details’ pages containing full information about the items. While this organization of data mitigates information overload and visual cluttering for sighted users, it however increases the interaction overhead and effort for blind users, as back-and-forth navigation between webpages using screen reader assistive technology is tedious and cumbersome. Existing usability-enhancing solutions are unable to provide adequate support in this regard as they predominantly focus on enabling efficient content access within a single webpage, and as such are not tailored for content distributed across multiple webpages. As an initial step towards addressing this issue, we developed AutoDesc, a browser extension that leverages a custom extraction model to automatically detect and pull out additional item descriptions from the ‘details’ pages, and then proactively inject the extracted information into the ‘Query-Results’ page, thereby reducing the amount of back-and-forth screen reader navigation between the two webpages. In a study with 16 blind users, we observed that within the same time duration, the participants were able to peruse significantly more data items on average with AutoDesc, compared to that with their preferred screen readers as well as with a state-of-the-art solution.

SeeChart: Enabling Accessible Visualizations Through Interactive Natural Language Interface For People with Visual Impairments

Web-based data visualizations have become very popular for exploring data and communicating insights. Newspapers, journals, and reports regularly publish visualizations to tell compelling stories with data. Unfortunately, most visualizations are inaccessible to readers with visual impairments. For many charts on the web, there are no accompanying alternative (alt) texts, and even if such texts exist they do not adequately describe important insights from charts. To address the problem, we first interviewed 15 blind users to understand their challenges and requirements for reading data visualizations. Based on the insights from these interviews, we developed SeeChart, an interactive tool that automatically deconstructs charts from web pages and then converts them to accessible visualizations for blind people by enabling them to hear the chart summary as well as to interact through data points using the keyboard. Our evaluation with 14 blind participants suggests the efficacy of SeeChart in understanding key insights from charts and fulfilling their information needs while reducing their required time and cognitive burden.

Taming Entangled Accessibility Forum Threads for Efficient Screen Reading

Accessibility forums enable individuals with visual impairments to connect and collaboratively seek solutions to technical issues, as well as share reviews, best practices, and latest news. However, these forums are presently built on legacy systems that were primarily designed for sighted users, and are difficult to navigate with non-visual assistive technologies like screen-readers. Accessibility forum threads are “entangled”, with multiple sub-conversations interleaved with each other. This does not gel with the predominantly linear navigation of screen-readers. Screen-reader users often listen to reams of irrelevant posts while foraging for nuggets of interest. To address this and improve non-visual interaction efficiency, we present TASER, a browser extension that leverages a state-of-the-art conversation disentanglement algorithm to automatically identify and separate sub-conversations in a forum thread, and then presents these sub-conversations to the user via a custom interface specifically tailored for efficient and usable screen-reader interaction. In a user study with 11 screen-reader users, we observed that TASER significantly reduced the average user input actions and interaction times by and respectively along with a significant drop in cognitive load ( lower NASA-TLX score) compared to the status quo while performing representative information foraging tasks on accessibility forums.

A Dataset and Machine Learning Approach to Classify and Augment Interface Elements of Household Appliances to Support People with Visual Impairment

Many modern household appliances are challenging to operate for people with visual impairment. Low-contrast designs and insufficient tactile feedback make it difficult to distinguish interface elements and to recognize their function. Augmented reality (AR) can be used to visually highlight such elements and provide assistance to people with residual vision. To realize this goal, we (1) created a dataset consisting of 13,702 images of interfaces from household appliances and manually labeled control elements; (2) trained a neural network to recognize control elements and to distinguish between PushButton, TouchButton, Knob, Slider, and Toggle; and (3) designed various contrast-rich and visually simple AR augmentations for these elements. The results were implemented as a screen-based assistive AR application, which we tested in a user study with six individuals with visual impairment. Participants were able to recognize control elements that were imperceptible without the assistive application. The approach was well received, especially for the potential of familiarizing oneself with novel devices. The automatic parsing and augmentation of interfaces provide an important step toward the independent interaction of people with visual impairments with their everyday environment.

SESSION: Session 2

Lessons from the Development of an Anomaly Detection Interface on the Mars Perseverance Rover using the ISHMAP Framework

While anomaly detection stands among the most important and valuable problems across many scientific domains, anomaly detection research often focuses on AI methods that can lack the nuance and interpretability so critical to conducting scientific inquiry. We believe this exclusive focus on algorithms with a fixed framing ultimately blocks scientists from adopting even high-accuracy anomaly detection models in many scientific use cases. In this application paper we present the results of utilizing an alternative approach that situates the mathematical framing of machine learning based anomaly detection within a participatory design framework. In a collaboration with NASA scientists working with the PIXL instrument studying Martian planetary geochemistry as a part of the search for extra-terrestrial life; we report on over 18 months of in-context user research and co-design to define the key problems NASA scientists face when looking to detect and interpret spectral anomalies. We address these problems and develop a novel spectral anomaly detection toolkit for PIXL scientists that is highly accurate (93.4% test accuracy on detecting diffraction anomalies), while maintaining strong transparency to scientific interpretation. We also describe outcomes from a yearlong field deployment of the algorithm and associated interface, now used daily as a core component of the PIXL science team’s workflow, and directly situate the algorithm as a key contributor to discoveries around the potential habitability of Mars. Finally we introduce a new design framework which we developed through the course of this collaboration for co-creating anomaly detection algorithms: Iterative Semantic Heuristic Modeling of Anomalous Phenomena (ISHMAP), which provides a process for scientists and researchers to produce natively interpretable anomaly detection models. This work showcases an example of successfully bridging methodologies from AI and HCI within a scientific domain, and provides a resource in ISHMAP which may be used by other researchers and practitioners looking to partner with other scientific teams to achieve better science through more effective and interpretable anomaly detection tools.

CoColor: Interactive Exploration of Color Designs

Choosing colors is a pivotal but challenging component of graphic design. The paper presents an intelligent interaction technique supporting designers’ creativity in color design. It fills a gap in the literature by proposing an integrated technique for color exploration, assignment, and refinement: CoColor. Our design goals were 1) let designers focus on color choice by freeing them from pixel-level editing and 2) support rapid flow between low- and high-level decisions. Our interaction technique utilizes three steps – choice of focus, choice of suitable colors, and the colors’ application to designs – wherein the choices are interlinked and computer-assisted, thus supporting divergent and convergent thinking. It considers color harmony, visual saliency, and elementary accessibility requirements. The technique was incorporated into the popular design tool Figma and evaluated in a study with 16 designers. Participants explored the coloring options more easily with CoColor and considered it helpful.

An Investigation into an Always Listening Interface to Support Data Exploration

Natural Language Interfaces that facilitate data exploration tasks are rapidly gaining in interest in the research community because they enable users to focus their attention on the task of inquiry rather than the mechanics of chart construction. Yet, current systems rely solely on processing the user’s explicit commands to generate the user’s intended chart. These commands can be ambiguous due to natural language tendencies such as speech disfluency and underspecification. In this paper, we developed and studied how an always listening interface can help contextualize imprecise queries. Our study revealed that an always listening interface is able to use an on-going conversation to fill in missing properties for imprecise commands, disambiguate inaccurate commands without asking the user for clarification, as well as generate charts without being explicitly asked.

Enabling Goal-Focused Exploration of Podcasts in Interactive Recommender Systems

Content recommender systems often rely on modeling users’ past behavioral data to provide personalized recommendations - a practice that works well for suggesting more of the same and for media that require little time investment from users, such as music tracks. However, this approach can be further optimized for media where the user investment is higher, such as podcasts, because there is a broader space of user goals that might not be captured by the implicit signals of their past behavior. Allowing users to directly specify their goals might help narrow the space of possible recommendations. Thus, in this paper, we explore how we can enable goal-focused exploration in recommender systems by leveraging explicit input from users about their personal goals. Using podcast consumption as an example use-case, and informed by a large-scale survey (N=68k), we developed GoalPods, an interactive prototype that allows users to set personal goals and build playlists of podcast episode recommendations to meet those goals. We evaluated GoalPods with 14 participants where participants set a goal and spent a week listening to the episode playlist created for that goal. From the study, we identified two types of user goals: low-involvement (e.g. “combat boredom”) and high-involvement (e.g. “learn something new”) goals. Users found it easy to identify relevant recommendations for low-involvement goals, but they needed more structure and support to set high-involvement goals. By anchoring users on their personal goals to explore recommendations, GoalPods (and goal-focused podcast consumption) led to insightful content discovery outside the users’ filter bubbles. Based on our findings, we discuss opportunities for designing recommender systems that guide exploration via interactive goal-setting as well as implications for providing better recommendations by accounting for users’ personal goals.

Steering Recommendations and Visualising Its Impact: Effects on Adolescents’ Trust in E-Learning Platforms

Researchers have widely acknowledged the potential of control mechanisms with which end-users of recommender systems can better tailor recommendations. However, few e-learning environments so far incorporate such mechanisms, for example for steering recommended exercises. In addition, studies with adolescents in this context are rare. To address these limitations, we designed a control mechanism and a visualisation of the control’s impact through an iterative design process with adolescents and teachers. Then, we investigated how these functionalities affect adolescents’ trust in an e-learning platform that recommends maths exercises. A randomised controlled experiment with 76 middle school and high school adolescents showed that visualising the impact of exercised control significantly increases trust. Furthermore, having control over their mastery level seemed to inspire adolescents to reasonably challenge themselves and reflect upon the underlying recommendation algorithm. Finally, a significant increase in perceived transparency suggested that visualising steering actions can indirectly explain why recommendations are suitable, which opens interesting research tracks for the broader field of explainable AI.

SESSION: Session 3

Categorical and Continuous Features in Counterfactual Explanations of AI Systems

Recently, eXplainable AI (XAI) research has focused on the use of counterfactual explanations to address interpretability, algorithmic recourse, and bias in AI system decision-making. The proponents of these algorithms claim they meet users’ requirements for counterfactual explanations. For instance, many claim that the output of their algorithms work as explanations because they prioritise "plausible", "actionable" or "causally important" features in their generated counterfactuals. However, very few of these claims have been tested in controlled psychological studies, and we know very little about which aspects of counterfactual explanations help users to understand AI system decisions. Furthermore, we do not know whether counterfactual explanations are an advance on more traditional causal explanations that have a much longer history in AI (in explaining expert systems and decision trees). Accordingly, we carried out two user studies to (i) test a fundamental distinction in feature-types, between categorical and continuous features, and (ii) compare the relative effectiveness of counterfactual and causal explanations. The studies used a simulated, automated decision-making app that determined safe driving limits after drinking alcohol, based on predicted blood alcohol content, and user responses were measured objectively (users’ predictive accuracy) and subjectively (users’ satisfaction and trust judgments). Study 1 (N=127) showed that users understand explanations referring to categorical features more readily than those referring to continuous features. It also discovered a dissociation between objective and subjective measures: counterfactual explanations elicited higher accuracy of predictions than no-explanation control descriptions but no higher accuracy than causal explanations, yet counterfactual explanations elicited greater satisfaction and trust judgments than causal explanations. Study 2 (N=211) found that users were more accurate for categorically-transformed features compared to continuous ones, and also replicated the results of Study 1. The findings delineate important boundary conditions for current and future counterfactual explanation methods in XAI.

Investigating the Intelligibility of Plural Counterfactual Examples for Non-Expert Users: an Explanation User Interface Proposition and User Study

Plural counterfactual examples have been proposed to explain the prediction of a classifier by offering a user several instances of minimal modifications that may be performed to change the prediction. Yet, such explanations may provide too much information, generating potential confusion for the end-users with no specific knowledge, neither on the machine learning, nor on the application domains. In this paper, we investigate the design of explanation user interfaces for plural counterfactual examples offering comparative analysis features to mitigate this potential confusion and improve the intelligibility of such explanations for non-expert users. We propose an implementation of such an enhanced explanation user interface, illustrating it in a financial scenario related to a loan application. We then present the results of a lab user study conducted with 112 participants to evaluate the effectiveness of having plural examples and of offering comparative analysis principles, both on the objective understanding and satisfaction of such explanations. The results demonstrate the effectiveness of the plural condition, both on objective understanding and satisfaction scores, as compared to having a single counterfactual example. Beside the statistical analysis, we perform a thematic analysis of the participants’ responses to the open-response questions, that also shows encouraging results for the comparative analysis features on the objective understanding.

Directive Explanations for Monitoring the Risk of Diabetes Onset: Introducing Directive Data-Centric Explanations and Combinations to Support What-If Explorations

Explainable artificial intelligence is increasingly used in machine learning (ML) based decision-making systems in healthcare. However, little research has compared the utility of different explanation methods in guiding healthcare experts for patient care. Moreover, it is unclear how useful, understandable, actionable and trustworthy these methods are for healthcare experts, as they often require technical ML knowledge. This paper presents an explanation dashboard that predicts the risk of diabetes onset and explains those predictions with data-centric, feature-importance, and example-based explanations. We designed an interactive dashboard to assist healthcare experts, such as nurses and physicians, in monitoring the risk of diabetes onset and recommending measures to minimize risk. We conducted a qualitative study with 11 healthcare experts and a mixed-methods study with 45 healthcare experts and 51 diabetic patients to compare the different explanation methods in our dashboard in terms of understandability, usefulness, actionability, and trust. Results indicate that our participants preferred our representation of data-centric explanations that provide local explanations with a global overview over other methods. Therefore, this paper highlights the importance of visually directive data-centric explanation method for assisting healthcare experts to gain actionable insights from patient health records. Furthermore, we share our design implications for tailoring the visual representation of different explanation methods for healthcare experts.

Follow the Successful Herd: Towards Explanations for Improved Use and Mental Models of Natural Language Systems

While natural language systems continue improving, they are still imperfect. If a user has a better understanding of how a system works, they may be able to better accomplish their goals even in imperfect systems. We explored whether explanations can support effective authoring of natural language utterances and how those explanations impact users’ mental models in the context of a natural language system that generates small programs. Through an online study (n=252), we compared two main types of explanations: 1) system-focused, which provide information about how the system processes utterances and matches terms to a knowledge base, and 2) social, which provide information about how other users have successfully interacted with the system. Our results indicate that providing social suggestions of terms to add to an utterance helped users to repair and generate correct flows more than system-focused explanations or social recommendations of words to modify. We also found that participants commonly understood some mechanisms of the natural language system, such as the matching of terms to a knowledge base, but they often lacked other critical knowledge, such as how the system handled structuring and ordering. Based on these findings, we make design recommendations for supporting interactions with and understanding of natural language systems.

Subgoal-Based Explanations for Unreliable Intelligent Decision Support Systems

Intelligent decision support (IDS) systems leverage artificial intelligence techniques to generate recommendations that guide human users through the decision making phases of a task. However, a key challenge is that IDS systems are not perfect, and in complex real-world scenarios may produce suboptimal output or fail to work altogether. The field of explainable AI (XAI) has sought to develop techniques that improve the interpretability of black-box systems. While most XAI work has focused on single-classification tasks, the subfield of explainable AI planning (XAIP) has sought to develop techniques that make sequential decision making AI systems explainable to domain experts. Critically, prior work in applying XAIP techniques to IDS systems has assumed that the plan being proposed by the planner is always optimal, and therefore the action or plan being recommended as decision support to the user is always optimal. In this work, we examine novice user interactions with a non-robust IDS system – one that occasionally recommends suboptimal actions, and one that may become unavailable after users have become accustomed to its guidance. We introduce a new explanation type, subgoal-based explanations, for plan-based IDS systems, that supplements traditional IDS output with information about the subgoal toward which the recommended action would contribute. We demonstrate that subgoal-based explanations lead to improved user task performance in the presence of IDS recommendations, improve user ability to distinguish optimal and suboptimal IDS recommendations, and are preferred by users. Additionally, we demonstrate that subgoal-based explanations enable more robust user performance in the case of IDS failure, showing the significant benefit of training users for an underlying task with subgoal-based explanations.

Supporting High-Uncertainty Decisions through AI and Logic-Style Explanations

A common criteria for Explainable AI (XAI) is to support users in establishing appropriate trust in the AI – rejecting advice when it is incorrect, and accepting advice when it is correct. Previous findings suggest that explanations can cause an over-reliance on AI (overly accepting advice). Explanations that evoke appropriate trust are even more challenging for decision-making tasks that are difficult for humans and AI. For this reason, we study decision-making by non-experts in the high-uncertainty domain of stock trading. We compare the effectiveness of three different explanation styles (influenced by inductive, abductive, and deductive reasoning) and the role of AI confidence in terms of a) the users’ reliance on the XAI interface elements (charts with indicators, AI prediction, explanation), b) the correctness of the decision (task performance), and c) the agreement with the AI’s prediction. In contrast to previous work, we look at interactions between different aspects of decision-making, including AI correctness, and the combined effects of AI confidence and explanations styles. Our results show that specific explanation styles (abductive and deductive) improve the user’s task performance in the case of high AI confidence compared to inductive explanations. In other words, these styles of explanations were able to invoke correct decisions (for both positive and negative decisions) when the system was certain. In such a condition, the agreement between the user’s decision and the AI prediction confirms this finding, highlighting a significant agreement increase when the AI is correct. This suggests that both explanation styles are suitable for evoking appropriate trust in a confident AI.

Our findings further indicate a need to consider AI confidence as a criterion for including or excluding explanations from AI interfaces. In addition, this paper highlights the importance of carefully selecting an explanation style according to the characteristics of the task and data.

Drawing with Reframer: Emergence and Control in Co-Creative AI

Over the past few years, rapid developments in AI have resulted in new models capable of generating high-quality images and creative artefacts, most of which seek to fully automate the process of creation. In stark contrast, creative professionals rely on iteration—to change their mind, to modify their sketches, and to re-imagine. For that reason, end-to-end generative approaches limit application to real-world design workflows. We present a novel human-AI drawing interface called Reframer, along with a new survey instrument for evaluating co-creative systems. Based on a co-creative drawing model called the Collaborative, Interactive Context-Aware Design Agent (CICADA), Reframer uses CLIP-guided synthesis-by-optimisation to support real-time synchronous drawing with AI. We present two versions of Reframer’s interface, one that prioritises emergence and system agency and the other control and user agency. To begin exploring how these different interaction models might influence the user experience, we also propose the Mixed-Initiative Creativity Support Index (MICSI). MICSI rates co-creative systems along experiential axes relevant to AI co-creation. We administer MICSI and a short qualitative interview to users who engaged with the Reframer variants on two distinct creative tasks. The results show overall broad efficacy of Reframer as a creativity support tool, but MICSI also allows us to begin unpacking the complex interactions between learning effects, task type, visibility, control, and emergent behaviour. We conclude with a discussion of how these findings highlight challenges for future co-creative systems design.

SmartRecorder: An IMU-based Video Tutorial Creation by Demonstration System for Smartphone Interaction Tasks

This work focuses on an active topic in the HCI community, namely tutorial creation by demonstration. We present a novel tool named SmartRecorder that facilitates people, without video editing skills, creating video tutorials for smartphone interaction tasks. As automatic interaction trace extraction is a key component to tutorial generation, we seek to tackle the challenges of automatically extracting user interaction traces on smartphones from screencasts. Uniquely, with respect to prior research in this field, we combine computer vision techniques with IMU-based sensing algorithms, and the technical evaluation results show the importance of smartphone IMU data in improving system performance. With the extracted key information of each step, SmartRecorder generates instructional content initially and provides tutorial creators with a tutorial refinement editor designed based on a high recall (99.38%) of key steps to revise the initial instructional content. Finally, SmartRecorder generates video tutorials based on refined instructional content. The results of the user study demonstrate that SmartRecorder allows non-experts to create smartphone usage video tutorials with less time and higher satisfaction from recipients.

SESSION: Session 4

Efficient Human-in-the-loop System for Guiding DNNs Attention

Attention guidance is used to address dataset bias in deep learning, where the model relies on incorrect features to make decisions. Focusing on image classification tasks, we propose an efficient human-in-the-loop system to interactively direct the attention of classifiers to regions specified by users, thereby reducing the effect of co-occurrence bias and improving the transferability and interpretability of a deep neural network (DNN). Previous approaches for attention guidance require the preparation of pixel-level annotations and are not designed as interactive systems. We herein present a new interactive method that allows users to annotate images via simple clicks. Additionally, we identify a novel active learning strategy that can significantly reduce the number of annotations. We conduct both numerical evaluations and a user study to evaluate the proposed system using multiple datasets. Compared with the existing non-active-learning approach, which typically relies on considerable amounts of polygon-based segmentation masks to fine-tune or train the DNNs, our system can obtain fine-tuned networks on biased datasets in a more time- and cost-efficient manner and offers a more user-friendly experience. Our experimental results show that the proposed system is efficient, reasonable, and reliable. Our code is publicly available at https://github.com/ultratykis/Guiding-DNNs-Attention.

The Impact of Expertise in the Loop for Exploring Machine Rationality

Human-in-the-loop optimization utilizes human expertise to guide machine optimizers iteratively and search for an optimal solution in a solution space. While prior empirical studies mainly investigated novices, we analyzed the impact of the levels of expertise on the outcome quality and corresponding subjective satisfaction. We conducted a study (N=60) in text, photo, and 3D mesh optimization contexts. We found that novices can achieve an expert level of quality performance, but participants with higher expertise led to more optimization iteration with more explicit preference while keeping satisfaction low. In contrast, novices were more easily satisfied and terminated faster. Therefore, we identified that experts seek more diverse outcomes while the machine reaches optimal results, and the observed behavior can be used as a performance indicator for human-in-the-loop system designers to improve underlying models. We inform future research to be cautious about the impact of user expertise when designing human-in-the-loop systems.

Improving Fairness in Adaptive Social Exergames via Shapley Bandits

Algorithmic fairness is an essential requirement as AI becomes integrated in society. In the case of social applications where AI distributes resources, algorithms often must make decisions that will benefit a subset of users, sometimes repeatedly or exclusively, while attempting to maximize specific outcomes. How should we design such systems to serve users more fairly? This paper explores this question in the case where a group of users works toward a shared goal in a social exergame called Step Heroes. We identify adverse outcomes in traditional multi-armed bandits (MABs) and formalize the Greedy Bandit Problem. We then propose a solution based on a new type of fairness-aware multi-armed bandit, Shapley Bandits. It uses the Shapley Value for increasing overall player participation and intervention adherence rather than the maximization of total group output, which is traditionally achieved by favoring only high-performing participants. We evaluate our approach via a user study (n=46). Our results indicate that our Shapley Bandits effectively mediates the Greedy Bandit Problem and achieves better user retention and motivation across the participants.

Addressing UX Practitioners’ Challenges in Designing ML Applications: an Interactive Machine Learning Approach

UX practitioners face novel challenges when designing user interfaces for machine learning (ML)-enabled applications. Interactive ML paradigms, like AutoML and interactive machine teaching, lower the barrier for non-expert end users to create, understand, and use ML models, but their application to UX practice is largely unstudied. We conducted a task-based design study with 27 UX practitioners where we asked them to propose a proof-of-concept design for a new ML-enabled application. During the task, our participants were given opportunities to create, test, and modify ML models as part of their workflows. Through a qualitative analysis of our post-task interview, we found that direct, interactive experimentation with ML allowed UX practitioners to tie ML capabilities and underlying data to user goals, compose affordances to enhance end-user interactions with ML, and identify ML-related ethical risks and challenges. We discuss our findings in the context of previously established human-AI guidelines. We also identify some limitations of interactive ML in UX processes and propose research-informed machine teaching as a supplement to future design tools alongside interactive ML.

ScatterShot: Interactive In-context Example Curation for Text Transformation

The in-context learning capabilities of LLMs like GPT-3 allow annotators to customize an LLM to their specific tasks with a small number of examples. However, users tend to include only the most obvious patterns when crafting examples, resulting in underspecified in-context functions that fall short on unseen cases. Further, it is hard to know when “enough” examples have been included even for known patterns. In this work, we present ScatterShot, an interactive system for building high-quality demonstration sets for in-context learning. ScatterShot iteratively slices unlabeled data into task-specific patterns, samples informative inputs from underexplored or not-yet-saturated slices in an active learning manner, and helps users label more efficiently with the help of an LLM and the current example set. In simulation studies on two text perturbation scenarios, ScatterShot sampling improves the resulting few-shot functions by 4-5 percentage points over random sampling, with less variance as more examples are added. In a user study, ScatterShot greatly helps users in covering different patterns in the input space and labeling in-context examples more efficiently, resulting in better in-context learning and less user effort.

SESSION: Session 5

IRIS: Interpretable Rubric-Informed Segmentation for Action Quality Assessment

AI-driven Action Quality Assessment (AQA) of sports videos can mimic Olympic judges to help score performances as a second opinion or for training. However, these AI methods are uninterpretable and do not justify their scores, which is important for algorithmic accountability. Indeed, to account for their decisions, instead of scoring subjectively, sports judges use a consistent set of criteria — rubric — on multiple actions in each performance sequence. Therefore, we propose IRIS to perform Interpretable Rubric-Informed Segmentation on action sequences for AQA. We investigated IRIS for scoring videos of figure skating performance. IRIS predicts (1) action segments, (2) technical element score differences of each segment relative to base scores, (3) multiple program component scores, and (4) the summed final score. In a modeling study, we found that IRIS performs better than non-interpretable, state-of-the-art models. In a formative user study, practicing figure skaters agreed with the rubric-informed explanations, found them useful, and trusted AI judgments more. This work highlights the importance of using judgment rubrics to account for AI decisions.

Understanding Uncertainty: How Lay Decision-makers Perceive and Interpret Uncertainty in Human-AI Decision Making

Decision Support Systems (DSS) based on Machine Learning (ML) often aim to assist lay decision-makers, who are not math-savvy, in making high-stakes decisions. However, existing ML-based DSS are not always transparent about the probabilistic nature of ML predictions and how uncertain each prediction is. This lack of transparency could give lay decision-makers a false sense of reliability. Growing calls for AI transparency have led to increasing efforts to quantify and communicate model uncertainty. However, there are still gaps in knowledge regarding how and why the decision-makers utilize ML uncertainty information in their decision process. Here, we conducted a qualitative, think-aloud user study with 17 lay decision-makers who interacted with three different DSS: 1) interactive visualization, 2) DSS based on an ML model that provides predictions without uncertainty information, and 3) the same DSS with uncertainty information. Our qualitative analysis found that communicating uncertainty about ML predictions forced participants to slow down and think analytically about their decisions. This in turn made participants more vigilant, resulting in reduction in over-reliance on ML-based DSS. Our work contributes empirical knowledge on how lay decision-makers perceive, interpret, and make use of uncertainty information when interacting with DSS. Such foundational knowledge informs the design of future ML-based DSS that embrace transparent uncertainty communication.

Resilience Through Appropriation: Pilots’ View on Complex Decision Support

Intelligent decision support tools (DSTs) hold the promise to improve the quality of human decision-making in challenging situations like diversions in aviation. To achieve these improvements, a common goal in DST design is to calibrate decision makers’ trust in the system. However, this perspective is mostly informed by controlled studies and might not fully reflect the real-world complexity of diversions. In order to understand how DSTs can be beneficial in the view of those who have the best understanding of the complexity of diversions, we interviewed professional pilots. To facilitate discussions, we built two low-fidelity prototypes, each representing a different role a DST could assume: (a) actively suggesting and ranking airports based on pilot-specified criteria, and (b) unobtrusively hinting at data points the pilot should be aware of. We find that while pilots would not blindly trust a DST, they at the same time reject deliberate trust calibration in the moment of the decision. We revisit appropriation as a lens to understand this seeming contradiction as well as a range of means to enable appropriation. Aside from the commonly considered need for transparency, these include directability and continuous support throughout the entire decision process. Based on our design exploration, we encourage to expand the view on DST design beyond trust calibration at the point of the actual decision.

Appropriate Reliance on AI Advice: Conceptualization and the Effect of Explanations

AI advice is becoming increasingly popular, e.g., in investment and medical treatment decisions. As this advice is typically imperfect, decision-makers have to exert discretion as to whether actually follow that advice: they have to “appropriately” rely on correct and turn down incorrect advice. However, current research on appropriate reliance still lacks a common definition as well as an operational measurement concept. Additionally, no in-depth behavioral experiments have been conducted that help understand the factors influencing this behavior. In this paper, we propose Appropriateness of Reliance (AoR) as an underlying, quantifiable two-dimensional measurement concept. We develop a research model that analyzes the effect of providing explanations for AI advice. In an experiment with 200 participants, we demonstrate how these explanations influence the AoR, and, thus, the effectiveness of AI advice. Our work contributes fundamental concepts for the analysis of reliance behavior and the purposeful design of AI advisors.

The Role of Lexical Alignment in Human Understanding of Explanations by Conversational Agents

Explainable Artificial Intelligence (XAI) focuses on research and technology that can explain an AI system’s functioning and its underlying methods, and also on making these explanations better through personalization. Our research study investigates a natural language personalization method called lexical alignment in understanding an explanation provided by a conversational agent. The study setup was online and navigated the participants through an interaction with a conversational agent. Participants faced either an agent designed to align its responses to those of the participants, a misaligned agent, or a control condition that did not involve any dialogue. The dialogue delivered an explanation based on a pre-defined set of causes and effects. The recall and understanding of the explanations was evaluated using a combination of Yes-No questions, a Cloze test (fill-in-the-blanks), and What-style questions. The analysis of the test scores revealed a significant advantage in information recall for those who interacted with an aligning agent against the participants who either interacted with a non-aligning agent or did not go through any dialogue. The Yes-No type questions that included probes on higher-order inferences (understanding) also reflected an advantage for the participants who had an aligned dialogue against both non-aligned and no dialogue conditions. The results overall suggest a positive effect of lexical alignment on understanding of explanations.

Interacting with Next-Phrase Suggestions: How Suggestion Systems Aid and Influence the Cognitive Processes of Writing

Writing with next-phrase suggestions powered by large language models is becoming more pervasive by the day. However, research to understand writers’ interaction and decision-making processes while engaging with such systems is still emerging. We conducted a qualitative study to shed light on writers’ cognitive processes while writing with next-phrase suggestion systems. To do so, we recruited 14 amateur writers to write two movie reviews each, one without suggestions and one with suggestions. Additionally, we also positively and negatively biased the suggestion system to get a diverse range of instances where writers’ opinions and the bias in the language model align or misalign to varying degrees. We found that writers interact with next-phrase suggestions in various complex ways: Writers abstracted and extracted multiple parts of the suggestions and incorporated them within their writing, even when they disagreed with the suggestion as a whole; along with evaluating the suggestions on various criteria. The suggestion system also had various effects on the writing process, such as altering the writer’s usual writing plans, leading to higher levels of distraction etc. Based on our qualitative analysis using the cognitive process model of writing by Hayes [35] as a lens, we propose a theoretical model of ’writer-suggestion interaction’ for writing with GPT-2 (and causal language models in general) for a movie review writing task, followed by directions for future research and design.

SESSION: Session 6

Human-AI Collaboration: The Effect of AI Delegation on Human Task Performance and Task Satisfaction

Recent work has proposed artificial intelligence (AI) models that can learn to decide whether to make a prediction for an instance of a task or to delegate it to a human by considering both parties’ capabilities. In simulations with synthetically generated or context-independent human predictions, delegation can help improve the performance of human-AI teams—compared to humans or the AI model completing the task alone. However, so far, it remains unclear how humans perform and how they perceive the task when they are aware that an AI model delegated task instances to them. In an experimental study with 196 participants, we show that task performance and task satisfaction improve through AI delegation, regardless of whether humans are aware of the delegation. Additionally, we identify humans’ increased levels of self-efficacy as the underlying mechanism for these improvements in performance and satisfaction. Our findings provide initial evidence that allowing AI models to take over more management responsibilities can be an effective form of human-AI collaboration in workplaces.

ASAP: Endowing Adaptation Capability to Agent in Human-Agent Interaction

Socially Interactive Agents (SIAs) offer users with interactive face-to-face conversations. They can take the role of a speaker and communicate verbally and nonverbally their intentions and emotional states; but they should also act as active listener and be an interactive partner. In human-human interaction, interlocutors adapt their behaviors reciprocally and dynamically. The endowment of such adaptation capability can allow SIAs to show social and engaging behaviors. In this paper, we focus on modelizing the reciprocal adaptation to generate SIA behaviors for both conversational roles of speaker and listener. We propose the Augmented Self-Attention Pruning (ASAP) neural network model. ASAP incorporates recurrent neural network, attention mechanism of transformers, and pruning technique to learn the reciprocal adaptation via multimodal social signals. We evaluate our work objectively, via several metrics, and subjectively, through a user perception study where the SIA behaviors generated by ASAP is compared with those of other state-of-the-art models. Our results demonstrate that ASAP significantly outperforms the state-of-the-art models and thus shows the importance of reciprocal adaptation modeling.

Scim: Intelligent Skimming Support for Scientific Papers

Scholars need to keep up with an exponentially increasing flood of scientific papers. To aid this challenge, we introduce Scim, a novel intelligent interface that helps experienced researchers skim – or rapidly review – a paper to attain a cursory understanding of its contents. Scim supports the skimming process by highlighting salient paper contents in order to direct a reader’s attention. The system’s highlights are faceted by content type, evenly distributed across a paper, and have a density configurable by readers at both the global and local level. We evaluate Scim with both an in-lab usability study and a longitudinal diary study, revealing how its highlights facilitate the more efficient construction of a conceptualization of a paper. We conclude by discussing design considerations and tensions for the design of future intelligent skimming tools.

The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development

Large language models (LLMs) have recently been applied in software engineering to perform tasks such as translating code between programming languages, generating code from natural language, and autocompleting code as it is being written. When used within development tools, these systems typically treat each model invocation independently from all previous invocations, and only a specific limited functionality is exposed within the user interface. This approach to user interaction misses an opportunity for users to more deeply engage with the model by having the context of their previous interactions, as well as the context of their code, inform the model’s responses. We developed a prototype system – the Programmer’s Assistant – in order to explore the utility of conversational interactions grounded in code, as well as software engineers’ receptiveness to the idea of conversing with, rather than invoking, a code-fluent LLM. Through an evaluation with 42 participants with varied levels of programming experience, we found that our system was capable of conducting extended, multi-turn discussions, and that it enabled additional knowledge and capabilities beyond code generation to emerge from the LLM. Despite skeptical initial expectations for conversational programming assistance, participants were impressed by the breadth of the assistant’s capabilities, the quality of its responses, and its potential for improving their productivity. Our work demonstrates the unique potential of conversational interactions with LLMs for co-creative processes like software development.

Embodied Agents for Obstetric Simulation Training

.Post-partum hemorrhaging is a medical emergency that occurs during childbirth and, in extreme cases, can be life-threatening. It is the number one cause of maternal mortality worldwide. High-quality training of medical staff can contribute to early diagnosis and work towards preventing escalation towards more serious cases. Healthcare education uses manikin-based simulators to train obstetricians for various childbirth scenarios before training on real patients. However, these medical simulators lack certain key features portraying important symptoms and are incapable of communicating with the trainees. The authors present a digital embodiment agent that can improve the current state of the art by providing a specification of the requirements as well as an extensive design and development approach. This digital embodiment allows educators to respond and role-play as the patient in real time and can easily be integrated with existing training procedures. This research was performed in collaboration with medical experts, making a new contribution to medical training by bringing digital humans and the representation of affective interfaces to the field of healthcare.

It Seems Smart, but It Acts Stupid: Development of Trust in AI Advice in a Repeated Legal Decision-Making Task

Humans increasingly interact with AI systems, and successful interactions rely on individuals trusting such systems (when appropriate). Considering that trust is fragile and often cannot be restored quickly, we focus on how trust develops over time in a human-AI-interaction scenario. In a 2x2 between-subject experiment, we test how model accuracy (high vs. low) and type of explanation (human-like vs. not) affect trust in AI over time. We study a complex decision-making task in which individuals estimate jail time for 20 criminal law cases with AI advice. Results show that trust is significantly higher for high-accuracy models. Also, behavioral trust does not decline, and subjective trust even increases significantly with high accuracy. Human-like explanations did not generally affect trust but boosted trust in high-accuracy models.

Don’t fail me! The Level 5 Autonomous Driving Information Dilemma regarding Transparency and User Experience

Autonomous vehicles can behave unexpectedly, as automated systems that rely on data-driven machine learning have shown to infer false predictions or misclassifications, e.g., due to stickers on traffic signs, and thus fail in some situations. In critical situations, system designs must guarantee safety and reliability. However, in non-critical situations, the possibility of failures resulting in unexpected behaviour should be considered, as they negatively impact the passenger’s user experience and acceptance. We analyse if an interactive conversational user interface can mitigate negative experiences when interacting with imperfect artificial intelligence systems. In our quantitative interactive online survey (N=113) and comparative qualitative Wizard of Oz study (N=8), users were able to interact with an autonomous SAE level 5 driving simulation. Our findings demonstrate that increased transparency improves user experience and acceptance. Furthermore, we show that additional information in failure scenarios can lead to an information dilemma and should be implemented carefully.

Lessons Learned from Designing and Evaluating CLAICA: A Continuously Learning AI Cognitive Assistant

Learning to operate a complex system, such as an agile production line, can be a daunting task. The high variability in products and frequent reconfigurations make it difficult to keep documentation up-to-date and share new knowledge amongst factory workers. We introduce CLAICA, a Continuously Learning AI Cognitive Assistant that supports workers in the aforementioned scenario. CLAICA learns from (experienced) workers, formalizes new knowledge, stores it in a knowledge base, along with contextual information, and shares it when relevant. We conducted a user study with 83 participants who performed eight knowledge exchange tasks with CLAICA, completed a survey, and provided qualitative feedback. Our results provide a deeper understanding of how prior training, context expertise, and interaction modality affect the user experience of cognitive assistants. We draw on our results to elicit design and evaluation guidelines for cognitive assistants that support knowledge exchange in fast-paced and demanding environments, such as an agile production line.

SESSION: Session 7

D-Touch: Recognizing and Predicting Fine-grained Hand-face Touching Activities Using a Neck-mounted Wearable

This paper presents D-Touch, a neck-mounted wearable sensing system that can recognize and predict how a hand touches the face. It uses a neck-mounted infrared camera (IR), which takes pictures of the head from the neck. These IR camera images are processed and used to train a deep-learning model to recognize and predict touch time and positions. The study showed D-Touch distinguished 17 Facial related Activity (FrA), including 11 face touch positions and 6 other activities, with over 92.1% accuracy and predict the hand-touching T-zone from other FrA activities with an accuracy of 82.12% within 150 ms after the hand appeared in the camera. A study with 10 participants conducted in their homes without any constraints on participants showed that D-Touch can predict the hand-touching T-zone from other FrA activities with an accuracy of 72.3% within 150 ms after the camera saw the hand. Based on the study results, we further discuss the opportunities and challenges of deploying D-Touch in real-world scenarios.

FlexType: Flexible Text Input with a Small Set of Input Gestures

In many situations, it may be impractical or impossible to enter text by selecting precise locations on a physical or touchscreen keyboard. We present an ambiguous keyboard with four character groups that has potential applications for eyes-free text entry, as well as text entry using a single switch or a brain-computer interface. We develop a procedure for optimizing these character groupings based on a disambiguation algorithm that leverages a long-span language model. We produce both alphabetically-constrained and unconstrained character groups in an offline optimization experiment and compare them in a longitudinal user study. Our results did not show a significant difference between the constrained and unconstrained character groups after four hours of practice. As expected, participants had significantly more errors with the unconstrained groups in the first session, suggesting a higher barrier to learning the technique. We therefore recommend the alphabetically-constrained character groups, where participants were able to achieve an average entry rate of 12.0 words per minute with a 2.03% character error rate using a single hand and with no visual feedback.

Gaze Speedup: Eye Gaze Assisted Gesture Typing in Virtual Reality

Mid-air text input in augmented or virtual reality (AR/VR) is an open problem. One proposed solution is gesture typing where the user performs a gesture trace over the keyboard. However, this requires the user to move their hands precisely and continuously, potentially causing arm fatigue. With eye tracking available on AR/VR devices, multiple works have proposed gaze-driven gesture typing techniques. However, such techniques require the explicit use of gaze which are prone to Midas touch problems, conflicting with other gaze activities in the same moment. In this work, the user is not made aware that their gaze is being used to improve the interaction, making the use of gaze completely implicit. We observed that a user’s implicit gaze fixation location during gesture typing is usually the gesture cursor’s target location if the gesture cursor is moving toward it. Based on this observation, we propose the Speedup method in which we speed up the gesture cursor toward the user’s gaze fixation location, the speedup rate depends on how well the gesture cursor’s moving direction aligns with the gaze fixation. To reduce the overshooting near the target in the Speedup method, we further proposed the Gaussian Speedup method in which the speedup rate is dynamically reduced with a Gaussian function when the gesture cursor gets nearer to the gaze fixation. Using a wrist IMU as input, a 12-person study demonstrated that the Speedup method and Gaussian Speedup method reduced users’ hand movement by and respectively without any loss of typing speed or accuracy.

Understanding Adoption Barriers to Dwell-Free Eye-Typing: Design Implications from a Qualitative Deployment Study and Computational Simulations

Eye-typing is a slow and cumbersome text entry method typically used by individuals with no other practical means of communication. As an alternative, prior HCI research has proposed dwell-free eye-typing as a potential improvement that eliminates time-consuming and distracting dwell-timeouts. However, it is rare that such research ideas are translated into working products. This paper reports on a qualitative deployment study of a product that was developed to allow users access to a dwell-free eye-typing research solution. This allowed us to understand how such a research solution would work in practice, as part of users’ current communication solutions in their own homes. Based on interviews and observations, we discuss a number of design issues that currently act as barriers preventing widespread adoption of dwell-free eye-typing. The study findings are complemented with computational simulations in a range of conditions that were inspired by the findings in the deployment study. These simulations serve to both contextualize the qualitative findings and to explore quantitative implications of possible interface redesigns. The combined analysis gives rise to a set of design implications for enabling wider adoption of dwell-free eye-typing in practice.

SESSION: Session 8

Evaluating Descriptive Quality of AI-Generated Audio Using Image-Schemas

Novel AI-generated audio samples are evaluated for descriptive qualities such as the smoothness of a morph using crowdsourced human listening tests. However, the methods to design interfaces for such experiments and to effectively articulate the descriptive audio quality under test receive very little attention in the evaluation metrics literature. In this paper, we explore the use of visual metaphors of image-schema to design interfaces to evaluate AI-generated audio. Furthermore, we highlight the importance of framing and contextualizing a descriptive audio quality under measurement using such constructs. Using both pitched sounds and textures, we conduct two sets of experiments to investigate how the quality of responses vary with audio and task complexities. Our results show that, in both cases, by using image-schemas we can improve the quality and consensus of AI-generated audio evaluations. Our findings reinforce the importance of interface design for listening tests and stationary visual constructs to communicate temporal qualities of AI-generated audio samples, especially to naïve listeners on crowdsourced platforms.

An Empirical Study of Model Errors and User Error Discovery and Repair Strategies in Natural Language Database Queries

Recent advances in machine learning (ML) and natural language processing (NLP) have led to significant improvement in natural language interfaces for structured databases (NL2SQL). Despite the great strides, the overall accuracy of NL2SQL models is still far from being perfect (∼ 75% on the Spider benchmark). In practice, this requires users to discern incorrect SQL queries generated by a model and manually fix them when using NL2SQL models. Currently, there is a lack of comprehensive understanding about the common errors in auto-generated SQLs and the effective strategies to recognize and fix such errors. To bridge the gap, we (1) performed an in-depth analysis of errors made by three state-of-the-art NL2SQL models; (2) distilled a taxonomy of NL2SQL model errors; and (3) conducted a within-subjects user study with 26 participants to investigate the effectiveness of three representative interactive mechanisms for error discovery and repair in NL2SQL. Findings from this paper shed light on the design of future error discovery and repair strategies for natural language data query interfaces.

Perspective: Leveraging Human Understanding for Identifying and Characterizing Image Atypicality

High-quality data plays a vital role in developing reliable image classification models. Despite that, what makes an image difficult to classify remains an unstudied topic. This paper provides a first-of-its-kind, model-agnostic characterization of image atypicality based on human understanding. We consider the setting of image classification “in the wild”, where a large number of unlabeled images are accessible, and introduce a scalable and effective human computation approach for proactive identification and characterization of atypical images. Our approach consists of i) an image atypicality identification and characterization task that presents to the human worker both a local view of visually similar images and a global view of images from the class of interest and ii) an automatic image sampling method that selects a diverse set of atypical images based on both visual and semantic features. We demonstrate the effectiveness and cost-efficiency of our approach through controlled crowdsourcing experiments and provide a characterization of image atypicality based on human annotations of 10K images. We showcase the utility of the identified atypical images by testing state-of-the-art image classification services against such images and provide an in-depth comparative analysis of the alignment between human- and machine-perceived image atypicality. Our findings have important implications for developing and deploying reliable image classification systems.

AutoDOViz: Human-Centered Automation for Decision Optimization

We present AutoDOViz, an interactive user interface for automated decision optimization (AutoDO) using reinforcement learning (RL). Decision optimization (DO) has classically being practiced by dedicated DO researchers [43] where experts need to spend long periods of time fine tuning a solution through trial-and-error. AutoML pipeline search has sought to make it easier for a data scientist to find the best machine learning pipeline by leveraging automation to search and tune the solution. More recently, these advances have been applied to the domain of AutoDO [36], with a similar goal to find the best reinforcement learning pipeline through algorithm selection and parameter tuning. However, Decision Optimization requires significantly more complex problem specification when compared to an ML problem. AutoDOViz seeks to lower the barrier of entry for data scientists in problem specification for reinforcement learning problems, leverage the benefits of AutoDO algorithms for RL pipeline search and finally, create visualizations and policy insights in order to facilitate the typical interactive nature when communicating problem formulation and solution proposals between DO experts and domain experts. In this paper, we report our findings from semi-structured expert interviews with DO practitioners as well as business consultants, leading to design requirements for human-centered automation for DO with RL. We evaluate a system implementation with data scientists and find that they are significantly more open to engage in DO after using our proposed solution. AutoDOViz further increases trust in RL agent models and makes the automated training and evaluation process more comprehensible. As shown for other automation in ML tasks [33, 59], we also conclude automation of RL for DO can benefit from user and vice-versa when the interface promotes human-in-the-loop.

SESSION: Session 9

Human-Centered Deferred Inference: Measuring User Interactions and Setting Deferral Criteria for Human-AI Teams

Although deep learning holds the promise of novel and impactful interfaces, realizing such promise in practice remains a challenge: since dataset-driven deep-learned models assume a one-time human input, there is no recourse when they do not understand the input provided by the user. Works that address this via deferred inference—soliciting additional human input when uncertain—show meaningful improvement, but ignore key aspects of how users and models interact. In this work, we focus on the role of users in deferred inference and argue that the deferral criteria should be a function of the user and model as a team, not simply the model itself. In support of this, we introduce a novel mathematical formulation, validate it via an experiment analyzing the interactions of 25 individuals with a deep learning-based visiolinguistic model, and identify user-specific dependencies that are under-explored in prior work. We conclude by demonstrating two human-centered procedures for setting deferral criteria that are simple to implement, applicable to a wide variety of tasks, and perform equal to or better than equivalent procedures that use much larger datasets.

SlideSpecs: Automatic and Interactive Presentation Feedback Collation

Presenters often collect audience feedback through practice talks to refine their presentations. In formative interviews, we find that although text feedback and verbal discussions allow presenters to receive feedback, organizing that feedback into actionable presentation revisions remains challenging. Feedback may lack context, be redundant, and be spread across various emails, notes, and conversations. To collate and contextualize both text and verbal feedback, we present SlideSpecs. SlideSpecs lets audience members provide text feedback (e.g., ‘font too small’) while attaching an automatically detected context, including relevant slides (e.g., ‘Slide 7’) or content tags (e.g., ‘slide design’). SlideSpecs also records and transcribes spoken group discussions that commonly occur after practice talks and facilitates linking text critiques to relevant discussion segments. Finally, presenters can use SlideSpecs to review all text and spoken feedback in a single contextually rich interface (e.g., relevant slides, topics, and follow-up discussions). We demonstrate the effectiveness of SlideSpecs by deploying it in eight practice talks with a range of topics and purposes and reporting our findings.

SoundToons: Exemplar-Based Authoring of Interactive Audio-Driven Animation Sprites

Animations can come to life when they are synchronized with relevant sounds. Yet, synchronizing animations to audio requires tedious key-framing or programming, which is difficult for novice creators. There are existing tools that support audio-driven live animation, but they focus primarily on speech and have little or no support for non-speech sounds. We present SoundToons, an exemplar-based authoring tool for interactive, audio-driven animation focusing on non-speech sounds. Our tool enables novice creators to author live animations to a wide variety of non-speech sounds, such as clapping and instrumental music. We support two types of audio interactions: (1) discrete interaction, which triggers animations when a discrete sound event is detected, and (2) continuous, which synchronizes an animation to continuous audio parameters. By employing an exemplar-based iterative authoring approach, we empower novice creators to design and quickly refine interactive animations. User evaluations demonstrate that novice users can author and perform live audio-driven animation intuitively. Moreover, compared to other input modalities such as trackpads or foot pedals, users preferred using audio as an intuitive way to drive animation.

Interviewing the Interviewer: AI-generated Insights to Help Conduct Candidate-centric Interviews

The most popular way to assess talent around the world is through interviews. Interviewers contribute substantially to candidate experience in many organizations' hiring strategies. There is a lack of comprehensive understanding of what makes for a good interview experience and how interviewers can conduct candidate-centric interviews. An exploratory study with 123 candidates revealed critical metrics about interviewer behavior that affects candidate experience. These metrics informed the design of our AI-driven SmartView system that provides automated post-interview feedback to Interviewers. Real-world deployment of the system was conducted for three weeks with 35 interviewers. According to our study, most interviewers found that SmartView insights helped identify areas for improvement and could assist them in improving their interviewing skills.

Supporting Requesters in Writing Clear Crowdsourcing Task Descriptions Through Computational Flaw Assessment

Quality control is an, if not the, essential challenge in crowdsourcing. Unsatisfactory responses from crowd workers have been found to particularly result from ambiguous and incomplete task descriptions, often from inexperienced task requesters. However, creating clear task descriptions with sufficient information is a complex process for requesters in crowdsourcing marketplaces. In this paper, we investigate the extent to which requesters can be supported effectively in this process through computational techniques. To this end, we developed a tool that enables requesters to iteratively identify and correct eight common clarity flaws in their task descriptions before deployment on the platform. The tool can be used to write task descriptions from scratch or to assess and improve the clarity of prepared descriptions. It employs machine learning-based natural language processing models trained on real-world task descriptions that score a given task description for the eight clarity flaws. On this basis, the requester can iteratively revise and reassess the task description until it reaches a sufficient level of clarity. In a first user study, we let requesters create task descriptions using the tool and rate the tool’s different aspects of helpfulness thereafter. We then carried out a second user study with crowd workers, as those who are confronted with such descriptions in practice, to rate the clarity of the created task descriptions. According to our results, 65% of the requesters classified the helpfulness of the information provided by the tool high or very high (only 12% as low or very low). The requesters saw some room for improvement though, for example, concerning the display of bad examples. Nevertheless, 76% of the crowd workers believe that the overall clarity of the task descriptions created by the requesters using the tool improves over the initial version. In line with this, the automatically-computed clarity scores of the edited task descriptions were generally higher than those of the initial descriptions, indicating that the tool reliably predicts the clarity of task descriptions in overall terms.

Advice Provision in Teleoperation of Autonomous Vehicles

Teleoperation of autonomous vehicles has been gaining a lot of attention recently and is expected to play an important role in helping autonomous vehicles handle difficult situations which they cannot handle on their own. In such cases, a remote driver located in a teleoperation center can remotely drive the vehicle until the situation is resolved. However, teledriving is a challenging task and requires many cognitive resources from the teleoperator. Our goal is to assist the remote driver in some complex situations by giving the driver appropriate advice. The advice is displayed on the driver’s screen to help her make the right decision. To this end, we introduce the TeleOperator Advisor (TOA), an adaptive agent that provides assisting advice to a remote driver. We evaluate the TOA in a simulation-based setting in two scenarios: overtaking a slow vehicle and passing through a traffic light. Results indicate that our advice helps to reduce the cognitive load of the remote driver and improve driving performance.

SESSION: Session 10

Masktrap: Designing and Identifying Gestures to Transform Mask Strap into an Input Interface

Embedding technology into day-to-day wearables and creating smart devices such as smartwatches and smart-glasses has been a growing area of interest. In this paper, we explore the interaction around face masks, a common accessory worn by many to prevent the spread of infectious diseases. Particularly, we propose a method of using the straps of a face mask as an input medium. We identified a set of plausible gestures on mask straps through an elicitation study (N = 20), in which the participants proposed different gestures for a given referent. We then developed a prototype to identify the gestures performed on the mask straps and present the recognition accuracy from a user study with eight participants. Our results show the system achieves 93.07% classification accuracy for 12 gestures.

Physiologically Attentive User Interface for Improved Robot Teleoperation

User interfaces (UI) are shifting from being attention-hungry to being attentive to users’ needs upon interaction. Interfaces developed for robot teleoperation can be particularly complex, often displaying large amounts of information, which can increase the cognitive overload that prejudices the performance of the operator. This paper presents the development of a Physiologically Attentive User Interface (PAUI) prototype preliminary evaluated with six participants. A case study on Urban Search and Rescue (USAR) operations that teleoperate a robot was used although the proposed approach aims to be generic. The robot considered provides an overly complex Graphical User Interface (GUI) which does not allow access to its source code. This represents a recurring and challenging scenario when robots are still in use, but technical updates are no longer offered that usually mean their abandon. A major contribution of the approach is the possibility of recycling old systems while improving the UI made available to end users and considering as input their physiological data. The proposed PAUI analyses physiological data, facial expressions, and eye movements to classify three mental states (rest, workload, and stress). An Attentive User Interface (AUI) is then assembled by recycling a pre-existing GUI, which is dynamically modified according to the predicted mental state to improve the user's focus during mentally demanding situations. In addition to the novelty of the proposed PAUIs that take advantage of pre-existing GUIs, this work also contributes with the design of a user experiment comprising mental state induction tasks that successfully trigger high and low cognitive overload states. Results from the preliminary user evaluation revealed a tendency for improvement in the usefulness and ease of usage of the PAUI, although without statistical significance, due to the reduced number of subjects.

The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents

Previous studies regarding the perception of emotions for embodied virtual agents have shown the effectiveness of using virtual characters in conveying emotions through interactions with humans. However, creating an autonomous embodied conversational agent with expressive behaviors presents two major challenges. The first challenge is the difficulty of synthesizing the conversational behaviors for each modality that are as expressive as real human behaviors. The second challenge is that the affects are modeled independently, which makes it difficult to generate multimodal responses with consistent emotions across all modalities. In this work, we propose a conceptual framework, ACTOR (Affect-Consistent mulTimodal behaviOR generation), that aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We have conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The result shows that among all model conditions, our affect-consistent framework receives the highest Likert scores for the perception of driving affects. Our statistical analysis suggests that making a modality affect-inconsistent significantly decreases the perception of driving affects. We also observe that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects. Therefore, we conclude that multimodal emotion conditioning and affect consistency are vital to enhancing the perception of affects for embodied conversational agents.

TransASL: A Smart Glass based Comprehensive ASL Recognizer in Daily Life

Sign language is a primary language used by deaf and hard-of-hearing (DHH) communities. However, existing sign language translation solutions primarily focus on recognizing manual markers. The non-manual markers, such as negative head shaking, question markers, and mouthing, are critical grammatical and semantic components of sign language for better usability and generalizability. Considering the significant role of non-manual markers, we propose the TransASL, a real-time, end-to-end system for sign language recognition and translation. TransASL extracts feature from both manual markers and non-manual markers via a customized eyeglasses-style wearable device with two parallel sensing modalities. Manual marker information is collected by two pairs of outward-facing microphones and speakers mounted to the legs of the eyeglasses. In contrast, non-manual marker information is acquired from a pair of inward-facing microphones and speakers connected to the eyeglasses. Both manual and non-manual marker features undergo a multi-modal, multi-channel fusion network and are eventually recognized as comprehensible ASL content. We evaluate the recognition performance of various sign language expressions at both the word and sentence levels. Given 80 frequently used ASL words and 40 meaningful sentences consisting of manual and non-manual markers, TransASL can achieve the WER of 8.3% and 7.1%, respectively. Our proposed work reveals a great potential for convenient ASL recognition in daily communications between ASL signers and hearing people.

VR-LENS: Super Learning-based Cybersickness Detection and Explainable AI-Guided Deployment in Virtual Reality

Virtual reality (VR) systems are known for their susceptibility to cybersickness, which can seriously hinder users’ experience. Therefore, a plethora of recent research has proposed several automated methods based on machine learning (ML) and deep learning (DL) to detect cybersickness. However, these detection methods are perceived as computationally intensive and black-box methods. Thus, those techniques are neither trustworthy nor practical for deploying on standalone VR head-mounted displays (HMDs). This work presents an explainable artificial intelligence (XAI)-based framework VR-LENS for developing cybersickness detection ML models, explaining them, reducing their size, and deploying them in a Qualcomm Snapdragon 750G processor-based Samsung A52 device. Specifically, we first develop a novel super learning-based ensemble ML model for cybersickness detection. Next, we employ a post-hoc explanation method, such as SHapley Additive exPlanations (SHAP), Morris Sensitivity Analysis (MSA), Local Interpretable Model-Agnostic Explanations (LIME), and Partial Dependence Plot (PDP) to explain the expected results and identify the most dominant features. The super learner cybersickness model is then retrained using the identified dominant features. Our proposed method identified eye tracking, player position, and galvanic skin/heart rate response as the most dominant features for the integrated sensor, gameplay, and bio-physiological datasets. We also show that the proposed XAI-guided feature reduction significantly reduces the model training and inference time by 1.91X and 2.15X while maintaining baseline accuracy. For instance, using the integrated sensor dataset, our reduced super learner model outperforms the state-of-the-art works by classifying cybersickness into 4 classes (none, low, medium, and high) with an accuracy of and regressing (FMS 1–10) with a Root Mean Square Error (RMSE) of 0.03. Our proposed method can help researchers analyze, detect, and mitigate cybersickness in real time and deploy the super learner-based cybersickness detection model in standalone VR headsets.

EarPPG: Securing Your Identity with Your Ears

Wearable devices have become indispensable gadgets in people’s daily lives nowadays; especially wireless earphones have experienced unprecedented growth in recent years, which lead to increasing interest and explorations of user authentication techniques. Conventional user authentication methods embedded in wireless earphones that use microphones or other modalities are vulnerable to environmental factors, such as loud noises or occlusions. To address this limitation, we introduce EarPPG, a new biometric modality that takes advantage of the unique in-ear photoplethysmography (PPG) signals, altered by a user’s unique speaking behaviors. When the user is speaking, muscle movements cause changes in the blood vessel geometry, inducing unique PPG signal variations. As speaking behaviors and PPG signals are unique, the EarPPG combines both biometric traits and presents a secure and obscure authentication solution. The system first detects and segments EarPPG signals and proceeds to extract effective features to construct a user authentication model with the 1D ReGRU network. We conducted comprehensive real-world evaluations with 25 human participants and achieved 94.84% accuracy, 0.95 precision, recall, and f1-score, respectively. Moreover, considering the practical implications, we conducted several extensive in-the-wild experiments, including body motions, occlusions, lighting, and permanence. Overall outcomes of this study possess the potential to be embedded in future smart earable devices.

SESSION: Session 11

TimToShape: Supporting Practice of Musical Instruments by Visualizing Timbre with 2D Shapes based on Crossmodal Correspondences

Timbre is high-dimensional and sensuous, making it difficult for musical-instrument learners to improve their timbre. Although some systems exist to improve timbre, they require expert labeling for timbre evaluation; however, solely visualizing the results of unsupervised learning lacks the intuitiveness of feedback because human perception is not considered. Therefore, we employ crossmodal correspondences for intuitive visualization of the timbre. We designed TimToShape, a system that visualizes timbre with 2D shapes based on the user’s input of timbre–shape correspondences. TimToShape generates a shape morphed by linear interpolation according to the timbre’s position in the latent space, which is obtained by unsupervised learning with a variational autoencoder (VAE). We confirmed that people perceived shapes generated by TimToShape to correspond more to timbre than randomly generated shapes. Furthermore, a user study of six violin players revealed that TimToShape was well-received in terms of visual clarity and interpretability.

Mixed Multi-Model Semantic Interaction for Graph-based Narrative Visualizations

Narrative sensemaking is an essential part of understanding sequential data. Narrative maps are a visual representation model that can assist analysts to understand narratives. In this work, we present a semantic interaction (SI) framework for narrative maps that can support analysts through their sensemaking process. In contrast to traditional SI systems which rely on dimensionality reduction and work on a projection space, our approach has an additional abstraction layer—the structure space—that builds upon the projection space and encodes the narrative in a discrete structure. This extra layer introduces additional challenges that must be addressed when integrating SI with the narrative extraction pipeline. We address these challenges by presenting the general concept of Mixed Multi-Model Semantic Interaction (3MSI)—an SI pipeline, where the highest-level model corresponds to an abstract discrete structure and the lower-level models are continuous. To evaluate the performance of our 3MSI models for narrative maps, we present a quantitative simulation-based evaluation and a qualitative evaluation with case studies and expert feedback. We find that our SI system can model the analysts’ intent and support incremental formalism for narrative maps.

Living Memories: AI-Generated Characters as Digital Mementos

Every human culture has developed practices and rituals associated with remembering people of the past - be it for mourning, cultural preservation, or learning about historical events. In this paper, we present the concept of “Living Memories”: interactive digital mementos that are created from journals, letters and data that an individual have left behind. Like an interactive photograph, living memories can be talked to and asked questions, making accessing the knowledge, attitudes and past experiences of a person easily accessible. To demonstrate our concept, we created an AI-based system for generating living memories from any data source and implemented living memories of the three historical figures “Leonardo Da Vinci”, “Murasaki Shikibu”, and “Captain Robert Scott”. As a second key contribution, we present a novel metrics scheme for evaluating the accuracy of living memory architectures and show the accuracy of our pipeline to improve over baselines. Finally, we compare the user experience and learning effects of interacting with the living memory of Leonardo Da Vinci to reading his journal. Our results show that interacting with the living memory, in addition to simply reading a journal, increases learning effectiveness and motivation to learn about the character.

Pearl: A Technology Probe for Machine-Assisted Reflection on Personal Data

Reflection on one’s personal data can be an effective tool for supporting wellbeing. However, current wellbeing reflection support tools tend to offer a one-size-fits-all approach, ignoring the diversity of people’s wellbeing goals and their agency in the self-reflection process. In this work, we identify an opportunity to help people work toward their wellbeing goals by empowering them to reflect on their data on their own terms. Through a formative study, we inform the design and implementation of Pearl, a workplace wellbeing reflection support tool that allows users to explore their personal data in relation to their wellbeing goal. Pearl is a calendar-based interactive machine teaching system that allows users to visualize data sources and tag regions of interest on their calendar. In return, the system provides insights about these tags that can be saved to a reflection journal. We used Pearl as a technology probe with 12 participants without data science expertise and found that all participants successfully gained insights into their workplace wellbeing. In our analysis, we discuss how Pearl’s capabilities facilitate insights, the role of machine assistance in the self-reflection process, and the data sources that participants found most insightful. We conclude with design dimensions for intelligent reflection support systems as inspiration for future work.

Large-scale Text-to-Image Generation Models for Visual Artists’ Creative Works

Large-scale Text-to-image Generation Models (LTGMs) (e.g., DALL-E), self-supervised deep learning models trained on a huge dataset, have demonstrated the capacity for generating high-quality open-domain images from multi-modal input. Although they can even produce anthropomorphized versions of objects and animals, combine irrelevant concepts in reasonable ways, and give variation to any user-provided images, we witnessed such rapid technological advancement left many visual artists disoriented in leveraging LTGMs more actively in their creative works. Our goal in this work is to understand how visual artists would adopt LTGMs to support their creative works. To this end, we conducted an interview study as well as a systematic literature review of 72 system/application papers for a thorough examination. A total of 28 visual artists covering 35 distinct visual art domains acknowledged LTGMs’ versatile roles with high usability to support creative works in automating the creation process (i.e., automation), expanding their ideas (i.e., exploration), and facilitating or arbitrating in communication (i.e., mediation). We conclude by providing four design guidelines that future researchers can refer to in making intelligent user interfaces using LTGMs.

Interactive User Interface for Dialogue Summarization

Summarization is one of the important tasks of natural language processing used to distill information. Recently, the sequence-to-sequence method was applied, in a general manner, to summarization tasks. The problem is that a large amount of information must be pre-trained for a specific domain, and information other than input statements cannot be utilized. To compensate for this shortcoming, controllable summarization has recently been in the spotlight. We introduced three properties into controllable summarization: 1) a new human-machine communication input format, 2) a robust constraint-sensitive summarization method for these formats, and 3) a practical interactive summarization interface available to the user. Experiments on the Wizard-of-Wikipedia dataset show that applying this input format and the constraint-sensitive method enhances summarization performance compared to the typical method. A user study shows that the interactive summarization interface is practical and that participants are evaluating it positively.

User-Driven Support for Visualization Prototyping in D3

Templates have emerged as an effective approach to simplifying the visualization design and programming process. For example, they enable users to quickly generate multiple visualization designs even when using complex toolkits like D3. However, these templates are often treated as rigid artifacts that respond poorly to changes made outside of the template’s established parameters, limiting user creativity. Preserving the user’s creative flow requires a more dynamic approach to template-based visualization design, where tools can respond gracefully to users’ edits when they modify templates in unexpected ways. In this paper, we leverage the structural similarities revealed by templates to design resilient support features for prototyping D3 visualizations: recommendations to suggest complementary interactions for a users’ D3 program; and code augmentation to implement recommended interactions with a single click, even when users deviate from pre-defined templates. We demonstrate the utility of these features in Mirny, a design-focused prototyping environment for D3. In a user study with 20 D3 users, we find that these automated features enable participants to prototype their design ideas with significantly fewer programming iterations. We also characterize key modification strategies used by participants to customize D3 templates. Informed by our findings and participants’ feedback, we discuss the key implications of the use of templates for interleaving visualization programming and design.