The rapid advancement of the automotive industry towards automated and semi-automated vehicles has rendered traditional methods of vehicle interaction, such as touch-based and voice command systems, inadequate for a widening range of non-driving related tasks, such as referencing objects outside of the vehicle. Consequently, research has shifted toward gestural input (e.g., hand, gaze, and head pose gestures) as a more suitable mode of interaction during driving. However, due to the dynamic nature of driving and individual variation, there are significant differences in drivers’ gestural input performance. While, in theory, this inherent variability could be moderated by substantial data-driven machine learning models, prevalent methodologies lean towards constrained, single-instance trained models for object referencing. These models show a limited capacity to continuously adapt to the divergent behaviors of individual drivers and the variety of driving scenarios. To address this, we propose IcRegress, a novel regression-based incremental learning approach that adapts to changing behavior and the unique characteristics of drivers engaged in the dual task of driving and referencing objects. We suggest a more personalized and adaptable solution for multimodal gestural interfaces, employing continuous lifelong learning to enhance driver experience, safety, and convenience. Our approach was evaluated using an outside-the-vehicle object referencing use case, highlighting the superiority of the incremental learning models adapted over a single trained model across various driver traits such as handedness, driving experience, and numerous driving conditions. Finally, to facilitate reproducibility, ease deployment, and promote further research, we offer our approach as an open-source framework at https://github.com/amrgomaaelhady/IcRegress.
Traditionally, sports commentators provide viewers with diverse information, encompassing in-game developments and player performances. Yet young adult football viewers increasingly use mobile devices for deeper insights during football matches. Such insights into players on the pitch and performance statistics support viewers’ understanding of game stakes, creating a more engaging viewing experience. Inspired by commentators’ traditional roles and to incorporate information into a single platform, we developed AiCommentator, a Multimodal Conversational Agent (MCA) for embedded visualization and conversational interactions in football broadcast video. AiCommentator integrates embedded visualization, either with an automated non-interactive or with a responsive interactive commentary mode. Our system builds upon multimodal techniques, integrating computer vision and large language models, to demonstrate ways for designing tailored, interactive sports-viewing content. AiCommentator’s event system infers game states based on a multi-object tracking algorithm and computer vision backend, facilitating automated responsive commentary. We address three key topics: evaluating young adults’ satisfaction and immersion across the two viewing modes, enhancing viewer understanding of in-game events and players on the pitch, and devising methods to present this information in a usable manner. In a mixed-method evaluation (n=16) of AiCommentator, we found that the participants appreciated aspects of both system modes but preferred the interactive mode, expressing a higher degree of engagement and satisfaction. Our paper reports on our development of AiCommentator and presents the results from our user study, demonstrating the promise of interactive MCA for a more engaging sports viewing experience. Systems like AiCommentator could be pivotal in transforming the interactivity and accessibility of sports content, revolutionizing how sports viewers engage with video content.
Recent years have witnessed a dramatic growing trend of Virtual YouTubers (VTubers) as a new business on social media, such as YouTube, Twitch, and TikTok. However, a significant challenge arises when VTuber voice actors face health issues or retire, jeopardizing the continuity of their avatar’s recognizable voices. A potential solution reminiscent of Conan’s Bow Tie voice changer in the popular animation Case Closed (i.e., Detective Conan) has inspired our work. To make this a reality, we introduce VTuberBowTie, a user-friendly streaming voice conversion system for real-time VTuber livestreaming. We propose an innovative streaming voice conversion approach that tackles the challenges of limited context modeling and bidirectional context dependence inherent to conventional real-time voice conversion. Rather than individually processing the voice stream in data chunks, our approach adopts a fully sequential structure that leverages contextual information preceding the input chunk, thereby expanding the perceptual range and enabling seamless concatenation. Moreover, we developed a ready-to-use interaction interface for VTuberBowTie and deployed it on various computing platforms. The experimental results show that VTuberBowTie can achieve high-quality voice conversion in a streaming manner with a latency of 179.1ms on CPU and 70.8ms on GPU while providing users a friendly interactive experience.
Synthesizers are powerful tools that allow musicians to create dynamic and original sounds. Existing commercial interfaces for synthesizers typically require musicians to interact with complex low-level parameters or to manage large libraries of premade sounds. To address these challenges, we implement SynthScribe — a fullstack system that uses multimodal deep learning to let users express their intentions at a much higher level. We implement features which address a number of difficulties, namely 1) searching through existing sounds, 2) creating completely new sounds, and 3) making meaningful modifications to a given sound. This is achieved with three main features: a multimodal search engine for a large library of synthesizer sounds; a user centered genetic algorithm by which completely new sounds can be created and selected given the users preferences; a sound editing support feature which highlights and gives examples for key control parameters with respect to a text or audio based query. The results of our user studies show SynthScribe is capable of reliably retrieving and modifying sounds while also affording the ability to create completely new sounds that expand a musicians creative horizon.
Recent research has demonstrated the potential for representing intelligent guidance using multi-modal cues, yet few guidelines or processes exist to guide the design of such a system. In this paper, we seek to address this gap by investigating the design of multi-modality assistant systems for setting the optimal parameters in industrial plants. We present the results of our study conducted with 22 participants to evaluate the effectiveness and experience of different combinations of visual (Highlights and Ambient lights) and haptic (Clicks and Vibration) modalities for providing intelligent dynamic guidance. Our findings demonstrate that providing the intelligent guidance with the multi modality of Highlights+Ambient resulted in shorter task duration and higher practicality than Ambient lights alone. Moreover, Highlights+Ambient+Vibration guidance was rated with lower usability than Highlights+Ambient as well as higher mental demand than merely Highlights.
Recent progress in Text-to-Image (T2I) models promises transformative applications in art, design, education, medicine, and entertainment. These models, exemplified by Dall-e, Imagen, and Stable Diffusion, have the potential to revolutionize various industries. However, a primary concern is their operation as a ‘black-box’ for many users. Without understanding the underlying mechanics, users are unable to harness the full potential of these models. This study focuses on bridging this gap by developing and evaluating explanation techniques for T2I models, targeting inexperienced end users. While prior works have delved into Explainable AI (XAI) methods for classification or regression tasks, T2I generation poses distinct challenges. Through formative studies with experts, we identified unique explanation goals and subsequently designed tailored explanation strategies. We then empirically evaluated these methods with a cohort of 473 participants from Amazon Mechanical Turk (AMT) across three tasks. Our results highlight users’ ability to learn new keywords through explanations, a preference for example-based explanations, and challenges in comprehending explanations that significantly shift the image’s theme. Moreover, findings suggest users benefit from a limited set of concurrent explanations. Our main contributions include a curated dataset for evaluating T2I explainability techniques, insights from a comprehensive AMT user study, and observations critical for future T2I model explainability research.
While writing peer reviews resembles an important task in science, education, and large organizations, providing fruitful suggestions to peers is not a straightforward task, as different user interaction designs of text suggestion interfaces can have diverse effects on user behaviors when writing the review text. Generative language models might be able to support humans in formulating reviews with textual suggestions. Previous systems use two designs for providing text suggestions, but do not empirically evaluate them: inline and list of suggestions. To investigate the effects of embedding NLP text generation models in the two designs, we collected user requirements to implement Hamta as an example of assistants providing reviewers with text suggestions. Our experiment on comparing the two designs on 31 participants indicates that people using the inline interface provided longer reviews on average, while participants using the list of suggestions experienced more ease of use in using our tool. The results shed light on important design findings for embedding text generation models in user-centered assistants.
Group decision making plays a crucial role in our complex and interconnected world. The rise of AI technologies has the potential to provide data-driven insights to facilitate group decision making, although it is found that groups do not always utilize AI assistance appropriately. In this paper, we aim to examine whether and how the introduction of a devil’s advocate in the AI-assisted group decision making processes could help groups better utilize AI assistance and change the perceptions of group processes during decision making. Inspired by the exceptional conversational capabilities exhibited by modern large language models (LLMs), we design four different styles of devil’s advocate powered by LLMs, varying their interactivity (i.e., interactive vs. non-interactive) and their target of objection (i.e., challenge the AI recommendation or the majority opinion within the group). Through a randomized human-subject experiment, we find evidence suggesting that LLM-powered devil’s advocates that argue against the AI model’s decision recommendation have the potential to promote groups’ appropriate reliance on AI. Meanwhile, the introduction of LLM-powered devil’s advocate usually does not lead to substantial increases in people’s perceived workload for completing the group decision making tasks, while interactive LLM-powered devil’s advocates are perceived as more collaborating and of higher quality. We conclude by discussing the practical implications of our findings.
Peer review is a cornerstone of science. Research communities conduct peer reviews to assess contributions and to improve the overall quality of science work. Every year, new community members are recruited as peer reviewers for the first time. How could technology help novices adhere to their community’s practices and standards for peer reviewing? To better understand peer review practices and challenges, we conducted a formative study with 10 novices and 10 experts. We found that many experts adopt a workflow of annotating, note-taking, and synthesizing notes into well-justified reviews that align with community standards. Novices lack timely guidance on how to read and assess submissions and how to structure paper reviews. To support the peer review process, we developed ReviewFlow – an AI-driven workflow that scaffolds novices with contextual reflections to critique and annotate submissions, in-situ knowledge support to assess novelty, and notes-to-outline synthesis to help align peer reviews with community expectations. In a within-subjects experiment, 16 inexperienced reviewers wrote reviews in two conditions: using ReviewFlow and using a baseline environment with minimal guidance. With ReviewFlow, participants produced more comprehensive reviews, identifying more pros and cons. However, they still struggled to provide actionable suggestions to address the weaknesses. While participants appreciated the streamlined process support from ReviewFlow, they also expressed concerns about using AI as part of the scientific review process. We discuss the implications of using AI to scaffold the peer review process on scientific work and beyond.
In settings where users both need high accuracy and are time-pressured, such as doctors working in emergency rooms, we want to provide AI assistance that both increases decision accuracy and reduces decision-making time. Current literature focusses on how users interact with AI assistance when there is no time pressure, finding that different AI assistances have different benefits: some can reduce time taken while increasing overreliance on AI, while others do the opposite. The precise benefit can depend on both the user and task. In time-pressured scenarios, adapting when we show AI assistance is especially important: relying on the AI assistance can save time, and can therefore be beneficial when the AI is likely to be right. We would ideally adapt what AI assistance we show depending on various properties (of the task and of the user) in order to best trade off accuracy and time. We introduce a study where users have to answer a series of logic puzzles. We find that time pressure affects how users use different AI assistances, making some assistances more beneficial than others when compared to no-time-pressure settings. We also find that a user’s overreliance rate is a key predictor of their behaviour: overreliers and not-overreliers use different AI assistance types differently. We find marginal correlations between a user’s overreliance rate (which is related to the user’s trust in AI recommendations) and their personality traits (Big Five Personality traits). Overall, our work suggests that AI assistances have different accuracy-time tradeoffs when people are under time pressure compared to no time pressure, and we explore how we might adapt AI assistances in this setting.
AI systems have been known to amplify biases in real-world data. Explanations may help human-AI teams address these biases for fairer decision-making. Typically, explanations focus on salient input features. If a model is biased against some protected group, explanations may include features that demonstrate this bias, but when biases are realized through proxy features, the relationship between this proxy feature and the protected one may be less clear to a human. In this work, we study the effect of the presence of protected and proxy features on participants’ perception of model fairness and their ability to improve demographic parity over an AI alone. Further, we examine how different treatments—explanations, model bias disclosure and proxy correlation disclosure—affect fairness perception and parity. We find that explanations help people detect direct but not indirect biases. Additionally, regardless of bias type, explanations tend to increase agreement with model biases. Disclosures can help mitigate this effect for indirect biases, improving both unfairness recognition and decision-making fairness. We hope that our findings can help guide further research into advancing explanations in support of fair human-AI decision-making.
Manipulative design in user interfaces (conceptualized as dark patterns) has emerged as a significant impediment to the ethical design of technology and a threat to user agency and freedom of choice. While previous research focused on exploring these patterns in the context of graphical user interfaces, the impact of speech has largely been overlooked. We conducted a listening test (N = 50) to elicit participants’ preferences regarding different synthetic voices that varied in terms of synthesis method (concatenative vs. neural) and prosodic qualities (speech pace and pitch variance), and then evaluated their impact in an online decision-making study (N = 101). Our results indicate a significant effect of voice qualities on the participant’s choices, independently from the content of the available options. Our results also indicate that the voice’s perceived engagement, ease of understanding, and domain fit directly translate to its impact on participants’ behavior in decision-making tasks.
While human-centered approaches to machine learning explore various human roles within the interaction loop, the notion of Interactive Machine Teaching (IMT) emerged with a focus on leveraging the teaching skills of humans as a teacher to build machine learning systems. However, most systems and studies are devoted to single users. In this article, we study collaborative interactive machine teaching in the context of image classification to analyze how people can structure the teaching process collectively and to understand their experience. Our contributions are threefold. First, we developed a web application called TeachTOK that enables groups of users to curate data and train a model together incrementally. Second, we conducted a study in which ten participants were divided into three teams that competed to build an image classifier in nine days. Qualitative results of participants’ discussions in focus groups reveal the emergence of collaboration patterns in the machine teaching task, how collaboration helps revise teaching strategies and participants’ reflections on their interaction with the TeachTOK application. From these findings we provide implications for the design of more interactive, collaborative and participatory machine learning-based systems.
Label set construction—deciding on a group of distinct labels—is an essential stage in building a supervised machine learning (ML) application, as a badly designed label set negatively affects subsequent stages, such as training dataset construction, model training, and model deployment. Despite its significance, it is challenging for ML practitioners to come up with a well-defined label set, especially when no external references are available. Through our formative study (n=8), we observed that even with the help of external references or domain experts, ML practitioners still need to go through multiple iterations to gradually improve the label set. In this process, there exist challenges in collecting helpful feedback and utilizing it to make optimal refinement decisions. To support informed refinement, we present DynamicLabels, a system that aims to support a more informed label set-building process with crowd feedback. Crowd workers provide annotations and label suggestions to the ML practitioner’s label set, and the ML practitioner can review the feedback through multi-aspect analysis and refine the label set with crowd-made labels. Through a within-subjects study (n=16) using two datasets, we found that DynamicLabels enables better understanding and exploration of the collected feedback and supports a more structured and flexible refinement process. The crowd feedback helped ML practitioners explore diverse perspectives, spot current weaknesses, and shop from crowd-generated labels. Metrics and label suggestions in DynamicLabels helped in obtaining a high-level overview of the feedback, gaining assurance, and spotting surfacing conflicts and edge cases that could have been overlooked.
To compare and select machine learning models, relying on performance measures alone may not always be sufficient. This is particularly the case where different subsets, features, and predicted results may vary in importance relative to the task at hand. Explanation and visualization techniques are required to support model sensemaking and informed decision-making. However, a review shows that existing systems are mostly designed for model developers and not evaluated with target users in their effectiveness. To address this issue, this research proposes an interactive visualization, VMS (Visualization for Model Sensemaking and Selection), for users of the model to compare and select predictive models. VMS integrates performance-, instance-, and feature-level analysis to evaluate models from multiple angles. Particularly, a feature view integrating the value and contribution of hundreds of features supports model comparison on local and global scales. We exemplified VMS for comparing models predicting patients’ hospital length of stay through time-series health records and evaluated the prototype with 16 participants from the medical field. Results reveal evidence that VMS supports users to rationalize models in multiple ways and enables users to select the optimal models with a small sample size. User feedback suggests future directions on incorporating domain knowledge in model training, such as for different patient groups considering different sets of features as important.
Users with large domain knowledge can be reluctant to use prediction models. This also applies to the sports domain, where running coaches rarely rely on marathon prediction tools for race-plan advice for their runners’ next marathon. This paper studies the effect of adding interactivity to such prediction models, to incorporate and acknowledge users’ domain knowledge. In think-aloud sessions and an online study, we tested an interactive machine learning tool that allowed coaches to indicate the importance of earlier races feeding into the model. Our results show that coaches deploy rich knowledge when working with the model on runners familiar to them, and their adaptations improved model accuracy. Those coaches who could interact with the model displayed more trust and acceptance in the resulting predictions.
Representations of AI agents in user interfaces and robotics are predominantly White, not only in terms of facial and skin features, but also in the synthetic voices they use. In this paper we explore some unexpected challenges in the representation of race we found in the process of developing an U.S. English Text-to-Speech (TTS) system aimed to sound like an educated, professional, regional accent-free African American woman. The paper starts by presenting the results of focus groups with African American IT professionals where guidelines and challenges for the creation of a representative and appropriate TTS system were discussed and gathered, followed by a discussion about some of the technical difficulties faced by the TTS system developers. We then describe two studies with U.S. English speakers where the participants were not able to attribute the correct race to the African American TTS voice while overwhelmingly correctly recognizing the race of a White TTS system of similar quality. A focus group with African American IT workers not only confirmed the representativeness of the African American voice we built, but also suggested that the surprising recognition results may have been caused by the inability or the latent prejudice from non-African Americans to associate educated, non-vernacular, professionally-sounding voices to African American people.
As machine learning (ML) gains wider adoption in real-world applications, the validation of ML models becomes fundamental for its productization, particularly in safety-critical applications. Recently, data slice finding has emerged as a popular method for validating ML models, but it requires additional metadata or cross-modal embeddings for the slices to be interpretable. We propose ConceptSlicer, an integrated workflow that facilitates the slicing of computer vision models using visual concepts. This approach breaks down the image dataset into interpretable visual concepts, serving as metadata in the slice finding process. Our system offers insights into model issues and enables a deeper understanding of computer vision models’ strengths and weaknesses. We evaluate ConceptSlicer through interviews with eight domain experts and machine learning practitioners, and fine-tune the ML models based on their feedback. Our study also highlights varied attitudes towards large foundational models, encouraging contemplation of the challenges and opportunities presented by this technological advancement.
Large Language Model (LLM) assistants, such as ChatGPT, have emerged as potential alternatives to search methods for helping users navigate complex, feature-rich software. LLMs use vast training data from domain-specific texts, software manuals, and code repositories to mimic human-like interactions, offering tailored assistance, including step-by-step instructions. In this work, we investigated LLM-generated software guidance through a within-subject experiment with 16 participants and follow-up interviews. We compared a baseline LLM assistant with an LLM optimized for particular software contexts, SoftAIBot, which also offered guidelines for constructing appropriate prompts. We assessed task completion, perceived accuracy, relevance, and trust. Surprisingly, although SoftAIBot outperformed the baseline LLM, our results revealed no significant difference in LLM usage and user perceptions with or without prompt guidelines and the integration of domain context. Most users struggled to understand how the prompt’s text related to the LLM’s responses and often followed the LLM’s suggestions verbatim, even if they were incorrect. This resulted in difficulties when using the LLM’s advice for software tasks, leading to low task completion rates. Our detailed analysis also revealed that users remained unaware of inaccuracies in the LLM’s responses, indicating a gap between their lack of software expertise and their ability to evaluate the LLM’s assistance. With the growing push for designing domain-specific LLM assistants, we emphasize the importance of incorporating explainable, context-aware cues into LLMs to help users understand prompt-based interactions, identify biases, and maximize the utility of LLM assistants.
With the increasing prevalence of automatic decision-making systems, concerns regarding the fairness of these systems also arise. Without a universally agreed-upon definition of fairness, given an automated decision-making scenario, researchers often adopt a crowdsourced approach to solicit people’s preferences across multiple fairness definitions. However, it is often found that crowdsourced fairness preferences are highly context-dependent, making it intriguing to explore the driving factors behind these preferences. One plausible hypothesis is that people’s fairness preferences reflect their perceived risk levels for different decision-making mistakes, such that the fairness definition that equalizes across groups the type of mistakes that are perceived as most serious will be preferred. To test this conjecture, we conduct a human-subject study (N = 213) to study people’s fairness perceptions in three societal contexts. In particular, these three societal contexts differ on the expected level of risk associated with different types of decision mistakes, and we elicit both people’s fairness preferences and risk perceptions for each context. Our results show that people can often distinguish between different levels of decision risks across different societal contexts. However, we find that people’s fairness preferences do not vary significantly across the three selected societal contexts, except for within a certain subgroup of people (e.g., people with a certain racial background). As such, we observe minimal evidence suggesting that people’s risk perceptions of decision mistakes correlate with their fairness preference. These results highlight that fairness preferences are highly subjective and nuanced, and they might be primarily affected by factors other than the perceived risks of decision mistakes.
In the process of evaluating competencies for job or student recruitment through material screening, decision-makers can be influenced by inherent cognitive biases, such as the screening order or anchoring information, leading to inconsistent outcomes. To tackle this challenge, we conducted interviews with seven experts to understand their challenges and needs for support in the screening process. Building on their insights, we introduce BiasEye, a bias-aware real-time interactive material screening visualization system. BiasEye enhances awareness of cognitive biases by improving information accessibility and transparency. It also aids users in identifying and mitigating biases through a machine learning (ML) approach that models individual screening preferences. Findings from a mixed-design user study with 20 participants demonstrate that, compared to a baseline system lacking our bias-aware features, BiasEye increases participants’ bias awareness and boosts their confidence in making final decisions. At last, we discuss the potential of ML and visualization in mitigating biases during human decision-making tasks.
Preference-based learning aims to align robot task objectives with human values. One of the most common methods to infer human preferences is by pairwise comparisons of robot task trajectories. Traditional comparison-based preference labeling systems seldom support labelers to digest and identify critical differences between complex trajectories recorded in videos. Our formative study (N = 12) suggests that individuals may overlook non-salient task features and establish biased preference criteria during their preference elicitation process because of partial observations. In addition, they may experience mental fatigue when given many pairs to compare, causing their label quality to deteriorate. To mitigate these issues, we propose FARPLS, a Feature-Augmented Robot trajectory Preference Labeling System. FARPLS highlights potential outliers in a wide variety of task features that matter to humans and extracts the corresponding video keyframes for easy review and comparison. It also dynamically adjusts the labeling order according to users’ familiarities, difficulties of the trajectory pair, and level of disagreements. At the same time, the system monitors labelers’ consistency and provides feedback on labeling progress to keep labelers engaged. A between-subjects study (N = 42, 105 pairs of robot pick-and-place trajectories per person) shows that FARPLS can help users establish preference criteria more easily and notice more relevant details in the presented trajectories than the conventional interface. FARPLS also improves labeling consistency and engagement, mitigating challenges in preference elicitation without raising cognitive loads significantly.
Although recent developments in generative AI have greatly enhanced the capabilities of conversational agents such as Google’s Bard or OpenAI’s ChatGPT, it’s unclear whether the usage of these agents aids users across various contexts. To better understand how access to conversational AI affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (N=76) as they completed a programming exam with and without access to Bard. Effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" questions vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). Our findings include evidence of automation complacency, increased reliance on the AI over the course of the task, and increased performance for novices on “solve”-type questions when using the AI. We discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational AI.
Large language models (LLMs) with chat-based capabilities, such as ChatGPT, are widely used in various workflows. However, due to a limited understanding of these large-scale models, users struggle to use this technology and experience different kinds of dissatisfaction. Researchers have introduced several methods, such as prompt engineering, to improve model responses. However, they focus on enhancing the model’s performance in specific tasks, and little has been investigated on how to deal with the user dissatisfaction resulting from the model’s responses. Therefore, with ChatGPT as the case study, we examine users’ dissatisfaction along with their strategies to address the dissatisfaction. After organizing users’ dissatisfaction with LLM into seven categories based on a literature review, we collected 511 instances of dissatisfactory ChatGPT responses from 107 users and their detailed recollections of dissatisfactory experiences, which we released as a publicly accessible dataset. Our analysis reveals that users most frequently experience dissatisfaction when ChatGPT fails to grasp their intentions, while they rate the severity of dissatisfaction related to accuracy the highest. We also identified four tactics users employ to address their dissatisfaction and their effectiveness. We found that users often do not use any tactics to address their dissatisfaction, and even when using tactics, 72% of dissatisfaction remained unresolved. Moreover, we found that users with low knowledge of LLMs tend to face more dissatisfaction on accuracy while they often put minimal effort in addressing dissatisfaction. Based on these findings, we propose design implications for minimizing user dissatisfaction and enhancing the usability of chat-based LLM.
Data annotation interfaces predominantly leverage ground truth labels to guide annotators toward accurate responses. With the growing adoption of Artificial Intelligence (AI) in domain-specific professional tasks, it has become increasingly important to help beginning annotators identify how their early-stage knowledge can lead to inaccurate answers, which in turn, helps to ensure quality annotations at scale. To investigate this issue, we conducted a formative study involving eight individuals from the field of disaster management, each possessing varying levels of expertise. The goal was to understand the prevalent factors contributing to disagreements among annotators when classifying Twitter messages related to disasters and to analyze their respective responses. Our analysis identified two primary causes of disagreement between expert and beginner annotators: 1) a lack of contextual knowledge or uncertainty about the situation, and 2) the absence of visual or supplementary cues. Based on these findings, we designed a Context interface, which generates aids that help beginners identify potential mistakes and provide the hidden context of the presented tweet. The summative study compares Context design with two widely used designs in data annotation UI, Highlight and Reasoning-based interfaces. We found significant differences between these designs in terms of attitudinal and behavioral data. We conclude with implications for designing future interfaces aiming at closing the knowledge gap among annotators.
The document contains substantial unannotated data, necessitating extensive manual labeling efforts. To address this issue, we introduce PDFChatAnnotator, a human-LLM collaborative tool to collect multi-modal data from PDF catalogs. Initially, PDFChatAnnotator automatically employs our proposed multi-modal binding rules to link related data from different modalities and harnesses the information extraction capabilities of large language models (LLMs) to extract specific information from text descriptions. Furthermore, the tool empowers users to guide and refine the LLM’s annotations. During the annotation process, users can influence the LLM through multiple rounds of communication and example establishment via the provided interfaces. To assess the effectiveness of PDFChatAnnotator’s techniques, we conducted a technical evaluation using three catalogs with typical layouts as experimental data. The results showed that all accuracy rates for multi-modal binding exceeded 90%, and both the proposed "example establishment" and "interactive adjustment of requirements" contributed to enhanced accuracy rates.
Artificial Intelligence (AI) has been enhancing data analysis efficiency and accuracy during plant phenotyping, which is vital for tackling global agricultural and environmental challenges. Designing a reliable AI system to assist precise plant phenotyping begins with high-quality phenotypic feature annotation, which usually involves collaboration between plant scientists and AI specialists. However, due to the high level of diversity in these researchers’ backgrounds, it is likely that they have differing user needs from a fine-grained plant feature annotation system. We conducted semi-structured interviews with eight experienced annotators from diverse backgrounds, and observed how they interact with their preferred annotation system, to elucidate the challenges faced when annotating plant features and identify user needs. We collected qualitative responses to the interview questions, and conducted a quantitative evaluation of the agreement of their annotations on the given images. By analyzing the participants’ behaviors and the collected data, we identified common user needs and derived implications for the design of an AI-assisted annotation system, including providing a range of annotation options, the flexibility to adapt annotations, and functions to help addressing uncertainty. Our research contributes to the design of systems that make annotations efficient and reliable, not only benefiting plant phenotyping, but also other interdisciplinary fields that rely on user-driven annotations.
Large-scale 3D point clouds are often used as training data for 3D semantic segmentation, but the labor-intensive nature of the annotation process challenges the acquisition of sufficient labeled data. Meanwhile, there has been limited research on introducing novice annotators to acquire the labeled data by enhancing their annotation performance and user experience. Therefore, in this study, we explored solutions involving two dimensions: the presence of AI assistance and the number of classes visualized simultaneously in model’s segmentation results in HITL. We conducted a user study with 16 novice annotators who had no prior experience in 3D semantic segmentation, asking them to perform annotation tasks. The results revealed an interaction effect between the two dimensions on annotation accuracy and labeling efficiency. We also found that displaying multiple classes at once reduced the time taken for annotation. Moreover, visualizing multiple classes at once or the absence of AI assistance led to a greater increase in model accuracy compared to our baselines. The best user experience was observed when the visualization showed a single class at a time with AI assistance. Based on these findings, we discuss which environments can enhance novice annotators’ annotation performance and user experience in 3D semantic segmentation tasks within HITL contexts.
Learning qualitative analysis requires personalized feedback and in-depth discussion not possible for educators to provide in a large course, resulting in many students obtaining only a shallow exposure to qualitative user research and interpretative skills. To overcome this challenge, we introduce a learnersourcing method that builds on the Dawid-Skene expectation maximization (EM) algorithm to generate peer-based AI hints that support students in one aspect of qualitative analysis: determining what sentences are relevant to the research question. After one annotation round, class-wide annotations are used to predict relevant sentences and to generate hints prompting students to revisit missed or incorrectly annotated sentences. An in-the-wild deployment within a large course (N=122) showed that our algorithm converged to comparatively high accuracy despite noisy student labels, and after only ∼ 20 students. An analysis of student interviews found that peer-based AI hints helped improve understanding of research questions, led to more careful examination of transcript annotations, and improved understanding of when they were over-annotating or under-annotating.
Object detection tasks are central to the development of datasets and algorithms in computer vision and machine learning. Despite its centrality, object detection remains tedious and time-consuming due to the inherent interactions that are often associated with drawing precise annotations. In this paper, we introduce Snapper, an interactive and intelligent annotation tool that intercepts bounding box annotations as they’re drawn and “snaps” them to the nearby object edges in real-time. Through a mixed-design user study with 18 full-time annotators, we compare Snapper’s annotation mode to alternative modes of annotation and find that Snapper enables participants to complete object detection tasks 39% more quickly without diminishing annotation quality. Further, we find that participants perceive Snapper as a tool that is interactively intuitive, trustworthy, and helpful. We conclude by discussing the implications of our findings as they relate to augmenting annotators’ conventions for drawing annotations in practice.
Human-centered AI aims to bridge the gap between machine decision-making and human understanding. However, even for classification tasks where deep neural networks have achieved superb performance, there are currently few methods that link humans and AI well, especially on domain-specific tasks. In this paper, we propose SpaceEditing, a 2D spatial layout tool that enables human users to interact with the latent space of deep neural networks. During the interaction process, the tool’s algorithm automatically processes user actions, providing feedback to the network and leveraging triplet loss to effectively learn from user-modified information. We evaluate SpaceEditing with three case studies: (1) an archaeology researcher uses a bronze dataset; (2) a deep learning researcher uses a garbage classification dataset; (3) six deep learning beginners use a head pose dataset. The experimental results demonstrate the effectiveness of our tool in integrating human knowledge and improving network performance.
In a world driven by data visualization, ensuring the inclusive accessibility of charts for Blind and Visually Impaired (BVI) individuals remains a significant challenge. Charts are usually presented as raster graphics without textual and visual metadata needed for an equivalent exploration experience for BVI people. Additionally, converting these charts into accessible formats requires considerable effort from sighted individuals. Digitizing charts with metadata extraction is just one aspect of the issue; transforming it into accessible modalities, such as tactile graphics, presents another difficulty. To address these disparities, we propose Chart4Blind, an intelligent user interface that converts bitmap image representations of line charts into universally accessible formats. Chart4Blind achieves this transformation by generating Scalable Vector Graphics (SVG), Comma-Separated Values (CSV), and alternative text exports, all comply with established accessibility standards. Through interviews and a formal user study, we demonstrate that even inexperienced sighted users can make charts accessible in an average of 4 minutes using Chart4Blind, achieving a System Usability Scale rating of 90%. In comparison to existing approaches, Chart4Blind provides a comprehensive solution, generating end-to-end accessible SVGs suitable for assistive technologies such as embossed prints (papers and laser cut), 2D tactile displays, and screen readers. For additional information, including open-source codes and demos, please visit our project page https://moured.github.io/chart4blind/.
Informational videos serve as a crucial source for explaining conceptual and procedural knowledge to novices and experts alike. When producing informational videos, editors edit videos by overlaying text/images or trimming footage to enhance the video quality and make it more engaging. However, video editing can be difficult and time-consuming, especially for novice video editors who often struggle with expressing and implementing their editing ideas. To address this challenge, we first explored how multimodality—natural language (NL) and sketching, which are natural modalities humans use for expression—can be utilized to support video editors in expressing video editing ideas. We gathered 176 multimodal expressions of editing commands from 10 video editors, which revealed the patterns of use of NL and sketching in describing edit intents. Based on the findings, we present ExpressEdit, a system that enables editing videos via NL text and sketching on the video frame. Powered by LLM and vision models, the system interprets (1) temporal, (2) spatial, and (3) operational references in an NL command and spatial references from sketching. The system implements the interpreted edits, which then the user can iterate on. An observational study (N=10) showed that ExpressEdit enhanced the ability of novice video editors to express and implement their edit ideas. The system allowed participants to perform edits more efficiently and generate more ideas by generating edits based on user’s multimodal edit commands and supporting iterations on the editing commands. This work offers insights into the design of future multimodal interfaces and AI-based pipelines for video editing.
Display issues, often arising from design inconsistencies or software problems, can have a significant impact on both user experience and system functionality. This study focuses on three primary challenges in the field of display issues: the absence of a standardized classification system, the limitations of existing detection tools, and the inadequacy of available data. To systematically address these challenges, we introduce a comprehensive Display Issue Analysis Framework (DIS). Utilizing this framework, we construct a comprehensive and industry-validated taxonomy for display issues. When evaluating the capabilities of existing detection tools and the completeness of available data against this taxonomy, we find that current mainstream tools can identify only 77% of the cataloged display issues. This finding suggests that, although the field has received some attention from the industry, there is still room for further improvement and research. This study not only deepens our understanding of the classification of display issues and the capabilities of detection tools, but also provides valuable insights for future research and applications in this domain.
When individuals are talking while performing multiple tasks at the same time, it is sometimes easy to miss parts of a conversation and misinterpret subsequent statements or have difficulty following the conversation. In this work, we aim to identify statements that may lead to misinterpretation of the subsequent statement if missed and to prevent communication discrepancies. Although there have been several attempts to present images and text that provide topics to support conversation, there is currently no system that supports conversation by taking interpretability into account. We propose a conversation support system SCAINs Presenter that presents Statements Crucial for Awareness of Interpretive Nonsense (SCAINs), which are statements that are important for interpreting other sentences and are extracted by reproducing the interpretations of those who missed part of the conversation and those who did not. The unique point of the SCAINs Presenter is to display extracted sentences that influence the context of the subsequent dialogue by taking into account their interpretability. In particular, since SCAINs are sentences that may cause misinterpretation of the subsequent dialogue if they are absent, the SCAINs Presenter helps the users to be aware of the possibility of a conversation gap coming from the misinterpretation. Our experiments show that when SCAINs are omitted, the intention of the following statements often becomes unclear, and the meaning of the following statements changes. We also found that SCAINs can capture a unique aspect different from the merely important statements. Moreover, the results of case studies in a realistic setting suggest that looking at SCAINs encourages conversation participants to switch their focus from a subtask chat to an ongoing conversation that is a primary task. Our research clarifies the linguistic processing underlying the identification of high-context utterances and demonstrates the effectiveness of using them to support real person-to-person interactions.
Writing and editing mathematical expressions with complicated structures in computer system is difficult and time-consuming. To address this, we proposed MathAssist, a mathematical expression autocomplete technique that recommends full formulas in real-time based on the user’s input strokes. Our technique identifies user’s input purpose by matching the structure of the current user input to the structure of formulas in a database. To facilitate such process, we propose a novel tree-based formalization to represent formula. In comparison to a mathematical expression recognition algorithm (SRD) and a commercial MicroSoft Ink Equation (InkEqu), our approach outperformed both of them on task completion time (reduced by 37.14% and 37.58%) and accuracy (32.78% and 10.55% higher). We also discuss our findings in using autocomplete to assist formula editing.
Exploratory search is characterized by open-ended search tasks and uncertainty with respect to the clarity of users’ information needs. In the context of image retrieval, generative adversarial networks (GANs) present numerous opportunities for satisfying the information needs of users engaged in exploratory search compared to a collection of images. In this article, we present a novel approach for performing exploratory search on a GAN’s image space using interpretable GAN controls that can be summarized as sample, nudge, and rank. At each search iteration, we sample images from the GAN’s latent space. We implement faceted search by nudging the sampled images towards regions of the latent space containing the attributes associated with selected facets. Lastly, we rank the nudged images using reinforcement learning with relevance feedback. We present a comprehensive evaluation of the proposed approach, incorporating results from simulations and a user study. In simulation, we show that our approach efficiently adapts to user preferences, while preserving a high-level of image diversity. In the user study (N=30), a majority of participants (23/30) preferred our system to the baseline. Concordant with simulation results, users reported both higher perceived search efficiency and image diversity compared to the baseline. Indeed, due to the baseline system’s dependence on a warm-start procedure, users of our system examined significantly fewer images while achieving task outcomes of similar subjective quality.
The adoption of recommender systems (RSs) in various domains has become increasingly popular, but concerns have been raised about their lack of transparency and interpretability. While significant advancements have been made in creating explainable RSs, there is still a shortage of automated approaches that can deliver meaningful and contextual human-centered explanations. Numerous researchers have evaluated explanations based on human-generated recommendations and explanations to address this gap. However, such approaches do not scale for real-world systems. Building on recent research that exploits Large Language Models (LLMs) for RSs, we propose leveraging the conversational capabilities of ChatGPT to provide users with personalized, human-like, and meaningful explanations for recommended items. Our paper presents one of the first user studies that measure users’ perceptions of ChatGPT-generated explanations while acting as an RS. Regarding recommendations, we assess whether users prefer ChatGPT over random (but popular) recommendations. Concerning explanations, we assess users’ perceptions of personalization, effectiveness, and persuasiveness. Our findings reveal that users tend to prefer ChatGPT-generated recommendations over popular ones. Additionally, personalized rather than generic explanations prove to be more effective when the recommended item is unfamiliar.
Complementing human decision-making with AI advice offers substantial advantages. However, humans do not always trust AI advice appropriately and are overly sensitive to incidental AI errors, even in cases with overall good performance. Today’s research still needs to uncover the underlying aspects of trust decline and recovery over time in repeated human-AI interactions. Our work investigates the consequences of incidental AI error on (self-reported) trust and participants’ reliance on AI advice. Results from our experiment, where 208 participants evaluated 14 legal cases before and after receiving algorithmic advice, showed that trust significantly decreased after early and late errors but was rapidly restored in both scenarios. Reliance significantly dropped only for early errors but not for late errors. In both scenarios, reliance was able to be restored. Results suggest that late (compared to early) errors are less drastic in trust loss and allow quicker recovery. These findings align with an interpretation in which humans can build up trust over time if a system is performing well, making them more tolerant of incidental AI errors.
Statistical statements that refer to data to support narratives or claims are commonly used to inform readers about the magnitude of social issues. While contextualizing statistical statements with relevant data supports readers in building their own interpretation of statements, the complexity of finding contextual information on the web and linking statistical statements with it impedes readers’ efforts to do so. We present DataDive, an interactive tool for contextualizing statistical statements for the readers of online texts. Based on users’ selections of statistical statements, our tool uses an LLM-powered pipeline to generate candidates of relevant contexts and poses them as guiding questions to the user as potential contexts for exploration. When the user selects a question, DataDive employs visualizations to further help the user compare and explore contextually relevant data. A technical evaluation shows that DataDive generates important and diverse questions that facilitate exploration around statistical statements and retrieves relevant data for comparison. Moreover, a user study with 21 participants suggests that DataDive facilitates users to explore diverse contexts and to be more aware of how statistical data could relate to the text.
Research continuously shows that, despite the wide range of skills developed for Intelligent Personal Assistants (IPAs), users tend to engage with only a small number of them. One reason for this discrepancy is the issue of skill discoverability, which is commonly addressed through conversational recommendations. Current recommendation strategies, however, are limited due to information asymmetry, lack of interactivity, and an underdeveloped understanding of appropriate grouping of available skills. In this paper, we explore opportunities for interactive faceted skill recommendations using voice interfaces. Through an open card sort user study and semi-structured interviews, we identify and describe five facets driving users’ natural grouping of IPA skills (Thematic, Procedural, Cross-system, Environmental, and Recipient), and demonstrate the need for simultaneous support of these facets. We then discuss the implications of these findings for advancing the discoverability of IPA skills through the design of interactive conversational recommendations.
Creating humorous content has been shown to improve an individual’s emotional well-being by decreasing stress, overcoming anxiety, and enhancing interpersonal relationships. However, it is common knowledge that a good sense of humor is not common. In this paper, we propose a natural language processing (NLP) driven collaborative tool based on appropriate incongruity theory to assist novices in writing humorous content. We use cartoon-caption writing as the use case since it is a popular method where people engage in creating humorous content. The paper describes the design of our co-authoring tool and findings from a two-part user study where (1) 20 participants used our tool to co-author cartoon captions and (2) 66 participants evaluated those captions. Our findings show that the tool helped participants to identify incongruous visual elements in the cartoon, support ideation, and expand the narrative. This resulted in co-authored captions more frequently rated funnier than those written without the tool. This approach can be appropriated to other humor generation applications including creative writing, creating memes, sketch comedy, and advertising.
Adaptive user interfaces (AUIs) can improve user experience by automatically adapting how information and functionality are presented in a user interface. However, the dynamic nature and potentially numerous variations of AUIs make them challenging to author. In this paper, we present a generalized framework for defining adaptation as interpolations between UIs and introduce a computational approach for intelligently generating new variations of a UI from a small set of designs. Based on this approach, we develop FrameKit, an authoring tool with a programming-by-example interface that retains flexibility and control afforded by manual authoring while reducing effort through automatic generation. We demonstrate that FrameKit can support adaptations that typically require domain-specific toolkits, such as those found in context-aware applications, responsive UIs, and ability-based adaptation. We evaluated FrameKit with ten front-end developers, who successfully authored AUIs after a short tutorial session and suggested that FrameKit provides an effective mental model for AUI authoring.
Generating preferred images using generative adversarial networks (GANs) is challenging owing to the high-dimensional nature of latent space. In this study, we propose a novel approach that uses simple user-swipe interactions to generate preferred images for users. To effectively explore the latent space with only swipe interactions, we apply principal component analysis to the latent space of the StyleGAN, creating meaningful subspaces. We use a multi-armed bandit algorithm to decide the dimensions to explore, focusing on the preferences of the user. Experiments show that our method is more efficient in generating preferred images than the baseline methods. Furthermore, changes in preferred images during image generation or the display of entirely different image styles were observed to provide new inspirations, subsequently altering user preferences. This highlights the dynamic nature of user preferences, which our proposed approach recognizes and enhances.
Despite significant historical progress, discrimination and social stigma continue to impact the lives of LGBTQIA+ individuals. The use of AI-generated virtual characters offers a unique opportunity to facilitate advocacy by engaging individuals in simulated conversations that can foster understanding, education, and empathy. This paper explores the potential of AI simulations to help individuals practice LGBTQIA+ advocacy, while also acknowledging the need for ethical considerations and addressing concerns about oversimplification or perpetuation of stereotypes. By combining technological innovation with a commitment to inclusivity, we aim to contribute to the ongoing struggle for equality in both the legal framework and the hearts and minds of the community. We present a study evaluating virtual characters driven by generative conversational AI simulating the social interactions surrounding “coming out of the closet”, a rite of passage associated with LGBTQIA+ communities. In our study, virtual characters embodied as queer individuals engage with users in a text-based conversation simulation paired with visual representations. We investigate how the interactions between the virtual characters and a user influence the user’s comfort, confidence, empathy and sympathy. The AI simulation includes distinct visual personas deployed in a series of conditions. We present findings from our deployments involving 307 users. Finally, we discuss the design implications of our work on the potential future of embodied, self-actuated and openly LGBTQIA+ intelligent agents.
Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user’s footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE’s effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users’ creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.
Pattern-recognition-based arm prostheses rely on recognizing muscle activation to trigger movements. The effectiveness of this approach depends not only on the performance of the machine learner but also on the user’s understanding of its recognition capabilities, allowing them to adapt and work around recognition failures. We investigate how different model training strategies to select gesture classes and record respective muscle contractions impact model accuracy and user comprehension. We report on a lab experiment where participants performed hand gestures to train a classifier under three conditions: (1) the system cues gesture classes randomly (control), (2) the user selects gesture classes (teacher-led), (3) the system queries gesture classes based on their separability (learner-led). After training, we compare the models’ accuracy and test participants’ predictive understanding of the prosthesis’ behavior. We found that teacher-led and learner-led strategies yield faster and greater performance increases, respectively. Combining two evaluation methods, we found that participants developed a more accurate mental model when the system queried the least separable gesture class (learner-led). Our results conclude that, in the context of machine learning-based myoelectric prosthesis control, guiding the user to focus on class separability during training can improve recognition performances and support users’ mental models about the system’s behavior. We discuss our results in light of several research fields : myoelectric prosthesis control, motor learning, human-robot interaction, and interactive machine teaching.
Nonspeaking autistic individuals ("nonspeakers") represent about one-third of the autistic population, and most are never provided with an effective alternative to speech, hindering their educational, employment, and social opportunities. Some individuals have learned to spell words and sentences by pointing to letters on a physical letterboard held vertically in their field of view by a trained human assistant. While this method is effective, nonspeakers have expressed to us a desire to transition towards a more independent communication method that relies less on a human assistant, which would provide them with more autonomy and privacy. Augmented Reality (AR) based communication systems have the potential to address this objective. For example, an AR-based communication system can lessen the reliance on a human assistant by employing a virtual letterboard that is automatically and adaptively placed in a personalized manner that considers a given user’s unique motor skills and movement patterns. In this paper, we explore the use of Behavioural Cloning (BC) to derive such a personalized placement policy. Specifically, we observe finger, palm, head, and physical letterboard poses during real-life interactions between a nonspeaker and their assistant. These observations are then used to train a BC Machine Learning (ML) model that can adapt the placement of a virtual letterboard for that user within an AR environment. Results show that our approach can accurately replicate the actions of the human assistant of any given user, outperforming a non-ML baseline personalized placement policy in both positional and rotational accuracies. This work represents a foundational step toward enabling more autonomous and private communications for nonspeakers, thereby opening up new opportunities for them.
Currently doctors rely on tools such as the Unified Parkinson’s Disease Rating Scale (MDS-UDPRS) and the Scale for the Assessment and Rating of Ataxia (SARA) to make diagnoses for movement disorders based on clinical observations of a patient’s motor movement. Observation-based assessments however are inherently subjective and can differ by person. Moreover, different movement disorders show overlapping symptoms, challenging neurologists to make a correct diagnosis based on eyesight alone. In this work, we create an intelligent interface to highlight movements and gestures that are indicative of a movement disorder to observing doctors. First, we analyzed the walking patterns of 43 participants with Parkinson’s Disease (PD), 60 participants with ataxia, and 52 participants with no movement disorder to find ten metrics that can be used to distinguish PD from ataxia. Next, we designed an interface that provides context to the gestures that are relevant to a movement disorder diagnosis. Finally, we surveyed two neurologists (one who specializes in PD and the other who specializes in ataxia) on how useful this interface is for making a diagnosis. Our results not only showcase additional metrics that can be used to evaluate movement disorders quantitatively but also outline steps to be taken when designing an interface for these kinds of diagnostic tasks.
To improve the accessibility of visual figures, auto-generation of text description of individual images has been studied. However, it cannot be directly applied to comics as the descriptions can be redundant as similar scenes appear in a row. To address this issue, we propose generating the descriptions per group of related images and demonstrate how an dense captioning technique for videos can be utilized for this purpose and ways to improve its performance. To assess the effectiveness of our approach and to identify factors affecting the quality of text descriptions of comics, we conducted a preliminary study with 3 sighted evaluators and a main user study with 12 participants with visual impairments. The results show that text descriptions generated per group of images are perceived to be better than those generated per image in terms of accuracy, clarity, understandability, length, informativeness and preference for sighted groups, when annotator is human. In the same conditions, when the annotator is AI, it exhibited better performance in terms of length. Also, people with visual impairments prefer group descriptions because of conciseness, smooth connectivity of sentences, and non-repetitive features. Based on the findings, we provide design recommendations for generating accessible comic descriptions at a scale for blind users.
While 33.6% of college students suffer from mental health problems, only 24.6% of these students with symptoms would seek professional help due to their personal attitudes or costs associated with therapy. Psychotherapy chatbots may offer a solution as they are always available, anonymous, and cost-effective. Research has shown that these chatbots can significantly reduce symptoms of anxiety and depression. However, there is a lack of understanding about the personalization preferences of users and the effects of personalization on health outcomes. To investigate this, we developed a personalizable psychotherapy chatbot designed to provide personalized help. In a randomized controlled trial (n = 54), participants were either assigned to a personalizable condition or a non-personalizable control condition. After 1 week of usage, participants had a significantly higher therapeutic bond with the personalized version compared to the baseline. In fact, the therapeutic bond was similar to that between a psychologist and his client. This is a promising result, as a high therapeutic bond has been linked to therapeutic success in psychotherapy. Participants reported that the therapy style, personality, and avatar were the most important personalizable aspects of the chatbot. Participants also liked the chatbot’s usage of their name and the transparency about what the chatbot had learned about them. These features are likely important for establishing a strong therapeutic bond with users. However, the ability to personalize the chatbot had no impact on the usage intentions of the participants. This can be explained by the fact that users from both conditions equally reported that the chatbot was able to help them with their mental health. 53 participants also indicated that they would be willing to use a psychotherapy chatbot when integrated with a human therapist. These findings indicate the potential of psychotherapy chatbots and the need for further research on their integration with traditional psychotherapy.
Active learning (AL) systems have become increasingly popular for various applications in machine learning (ML), including medical imaging, environmental monitoring, and geospatial analysis. These systems rely on inputs dynamically queried from people to enhance classification. Ensuring appropriate analyst trust in these systems presents a significant obstacle as analyst over- or underreliance may adversely affect a given application. Common AL strategies enhance classification models by asking analysts to provide labels for data points with the highest degree of uncertainty. However, such model-centric policies do not consider potential priming effects on the analyst and how they will affect people’s trust in the system post-training. We present an empirical study assessing how AL query policies and visualizations that enhance transparency in a classifier’s decisions influence trust in automated image classifiers. We found that query policy may significantly influence an analyst’s perception of system capabilities, while the level of visual transparency into classifier certainty may influence an analyst’s ability to perform a classification task. Our study informs the design of interactive labeling systems to help mitigate the effects of overreliance and calibrate appropriate trust in automated systems.
The recent explosion in popularity of large language models (LLMs) has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM’s score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.
In traffic engineering, cities rely on large detector datasets to manage traffic. Visualizing these big, multi-dimensional datasets poses challenges such as overplotting and dimension reduction, often rendering traditional visualization techniques inadequate. To address this, we added two machine learning (ML) algorithms (Local Outlier Factor algorithm and K-Prototypes clustering) to an interactive time series visualization to improve exploration by both domain experts and non-experts. We used an original detector dataset of a mid-sized German city. Our findings reveal that the ML algorithms greatly enhanced data exploration in these interactive visualizations, particularly for users with limited domain knowledge. This research directly contributes to the design of traffic data analysis tools, offering a foundation for traffic detection hardware and software improvements but also advancing complex dataset visualization in general. It will ultimately lead to more informed decisions, improved traffic management, and has the potential to reduce air pollutants, thus counteracting climate change.
Natural language and search interfaces intuitively facilitate data exploration and provide visualization responses to diverse analytical queries based on the underlying datasets. However, these interfaces often fail to interpret more complex analytical intents, such as discerning subtleties and quantifiable differences between terms like “bump’’ and “spike’’ in the context of COVID cases, for example. We address this gap by extending the capabilities of a data exploration search interface for interpreting semantic concepts in time series trends. We first create a comprehensive dataset of semantic concepts by mapping quantifiable univariate data trends such as slope and angle to crowdsourced, semantically meaningful trend labels. The dataset contains quantifiable properties that capture the slope-scalar effect of semantic modifiers like “sharply” and “gradually,” as well as multi-line trends (e.g., “peak,” “valley”). We demonstrate the utility of this dataset in SlopeSeeker, a tool that supports natural language querying of quantifiable trends, such as “show me stocks that tanked in 2010.” The tool incorporates novel scoring and ranking techniques based on semantic relevance and visual prominence to present relevant trend chart responses containing these semantic trend concepts. In addition, SlopeSeeker provides a faceted search interface for users to navigate a semantic hierarchy of concepts from general trends (e.g., “increase’’) to more specific ones (e.g., “sharp increase’’). A preliminary user evaluation of the tool demonstrates that the search interface supports greater expressivity of queries containing concepts that describe data trends. We identify potential future directions for leveraging our publicly available quantitative semantics dataset in other data domains and for novel visual analytics interfaces.
Low-code programming allows citizen developers to create programs with minimal coding effort, typically via visual (e.g. drag-and-drop) interfaces. In parallel, recent AI-powered tools such as Copilot and ChatGPT generate programs from natural language instructions. We argue that these modalities are complementary: tools like ChatGPT greatly reduce the need to memorize large APIs but still require their users to read (and modify) textual programs, whereas visual tools abstract away most or all program text but struggle to provide easy access to large APIs. At their intersection, we propose LowCoder, the first low-code tool for developing AI pipelines that supports both a visual programming interface (LowCoderVP) and an AI-powered natural language interface (LowCoderNL). We leverage this tool to provide some of the first insights into whether and how these two modalities help programmers by conducting a user study. We task 20 developers with varying levels of AI expertise with implementing four ML pipelines using LowCoder, replacing the LowCoderNL component with a simple keyword search in half the tasks. Overall, we find that LowCoder is especially useful for (i) Discoverability: using LowCoderNL, participants discovered new operators in 75% of the tasks, compared to just 32.5% and 27.5% using web search or scrolling through options respectively in the keyword-search condition, and (ii) Iterative Composition: 82.5% of tasks were successfully completed and many initial pipelines were further successfully improved. Qualitative analysis shows that AI helps users discover how to implement constructs when they know what to do, but still fails to support novices when they lack clarity on what they want to accomplish. Overall, our work highlights the benefits of combining the power of AI with low-code programming.
Large language model (LLM) prompting is a promising new approach for users to create and customize their own chatbots. However, current methods for steering a chatbot’s outputs, such as prompt engineering and fine-tuning, do not support users in converting their natural feedback on the model’s outputs to changes in the prompt or model. In this work, we explore how to enable users to interactively refine model outputs through their feedback, by helping them convert their feedback into a set of principles (i.e. a constitution) that dictate the model’s behavior. From a formative study, we (1) found that users needed support converting their feedback into principles for the chatbot and (2) classified the different principle types desired by users. Inspired by these findings, we developed ConstitutionMaker, an interactive tool for converting user feedback into principles, to steer LLM-based chatbots. With ConstitutionMaker, users can provide either positive or negative feedback in natural language, select auto-generated feedback, or rewrite the chatbot’s response; each mode of feedback automatically generates a principle that is inserted into the chatbot’s prompt. In a user study with 14 participants, we compare ConstitutionMaker to an ablated version, where users write their own principles. With ConstitutionMaker, participants felt that their principles could better guide the chatbot, that they could more easily convert their feedback into principles, and that they could write principles more efficiently, with less mental demand. ConstitutionMaker helped users identify ways to improve the chatbot, formulate their intuitive responses to the model into feedback, and convert this feedback into specific and clear principles. Together, these findings inform future tools that support the interactive critiquing of LLM outputs.
This research explores integration of a Large Language Model (LLM) fine-tuned to conversationally control the user interface (UI) for a Semantic Automation Layer (SAL). We condense SAL capabilities from prior work and prioritize with business analysts and data engineers via a Kano model, before implementing a prototypical UI. We augment the UI with our conversational engine and propose In-situ Prompt Engineering and learn from Human Feedback to smoothen the interaction and manipulation of UI through natural language commands. To evaluate the efficacy and usability of conversational control in various use-case scenarios, we conduct and report on an empirical interaction design user study. Our findings provide evidence supporting enhanced user engagement and satisfaction. We also observe significant increase of trust in AI after working with our conversational UI. This work generates areas for further refinement and research towards more intelligent, highly-integrated conversational UIs even beyond our application within Semantic Automation. We discuss our findings and point out next steps paving the way for future research and development in creating more intuitive and adaptive user interfaces.
High-quality alt text is crucial for making scientific figures accessible to blind and low-vision readers. Crafting complete, accurate alt text is challenging even for domain experts, as published figures often depict complex visual information and readers have varied informational needs. These challenges, along with high diversity in figure types and domain-specific details, also limit the usefulness of fully automated approaches. Consequently, the prevalence of high-quality alt text is very low in scientific papers today. We investigate whether and how human-AI collaborative editing systems can help address the difficulty of writing high-quality alt text for complex scientific figures. We present FigurA11y, an interactive system that generates draft alt text and provides suggestions for author revisions using a pipeline driven by extracted figure and paper metadata. We test two versions, motivated by prior work on visual accessibility and writing support. The base Draft+Revise version provides authors with an automatically generated draft description to revise, along with extracted figure metadata and figure-specific alt text guidelines to support the revision process. The full Interactive Assistance version further adds contextualized suggestions: text snippets to iteratively produce descriptions, and hypothetical user questions with possible answers to reveal potential ambiguities and resolutions. In a study of authors (N=14), we found the system assisted them in efficiently producing descriptive alt text. Generated drafts and interface elements enabled authors to quickly initiate and edit detailed descriptions. Additionally, interactive suggestions from the full system prompted more iteration and highlighted aspects for authors to consider, resulting in greater deviation from the drafts without increased average cognitive load or manual effort.
Advances in AI, particularly large language models (LLMs), can transform creative work. When developing a new idea, LLMs can help designers gather information, find competitors, and generate alternatives. However, LLM responses tend to be long-winded or contain inaccuracies, placing a burden on users to carefully synthesize information. In our formative studies with 52 students and five instructors, we find that novice designers typically lack guidance on how to compose prompts, reflect critically on LLM responses, and extract key information to help shape an idea. Building on these insights, we explore an alternative approach for interacting with LLMs, not via chat, but rather through structured templates. Collaborative design templates are a well-established strategy for helping novices think, organize information, and reflect on creative work. Developed as a digital whiteboard plugin, Jamplate integrates LLM capabilities into design templates, streamlining the collection and organization of user-generated content and LLM responses within the template structure. In a preliminary study with 8 novice designers, participants expressed that Jamplate’s reflective questions and in-situ guidance improved their ability to think critically and improve ideas more effectively. We discuss the potential of designing LLM-enhanced templates to instigate critical reflection.
Community organizations face challenges in harnessing the power of qualitative data analysis, or sensemaking, to understand the diverse perspectives and needs brought up by their constituents. One of the most time-consuming and tedious parts of sensemaking is qualitative coding, or the process of identifying themes across a large and unstructured corpus of community input. A challenge in qualitative coding is attaining high intercoder reliability, especially between expert and beginner sensemakers. In this work, we present SenseMate, a novel human-AI system designed to help with qualitative coding. SenseMate leverages rationale extraction models, a new machine learning strategy to semi-automate sensemaking, which produces theme recommendations and human-interpretable explanations. The models were trained on a dataset of people’s experiences living in Boston, which was annotated for themes by expert sensemakers. We integrated rationale extraction models into SenseMate through an iterative, human-centered design process revolving around four key design principles derived from an extensive literature review. The design process consisted of three iterations with continuous feedback from seven people associated with community organizations. Through an online experiment involving 180 novice sensemakers, we aimed to determine whether AI-generated recommendations and rationales would decrease coding time, increase intercoder reliability (i.e. Cohen’s kappa), and minimize differences between novice and expert coding decisions (i.e. F-score of participant answers compared to expert gold labels). We found that though the model recommendations and explanations increased coding time by 49 seconds per unit of analysis, they raised intercoder reliability by 29% and coding F-score by 10%. Regarding the effectiveness of SenseMate’s design, participants reported that the platform was generally easy to use. In summary, Sensemate is (1) built for beginner sensemakers without a technical background, a user group that prior work doesn’t focus on, (2) implements rationale extraction models to recommend themes and generate explanations, which has advantages over large language models in terms of user privacy and control, and (3) contains original and intuitive features created from user feedback that can be applied to future QDA systems.