publications | Pradip Pramanick

2024

ICMI

Effects of Incoherence in Multimodal Explanations of Robot Failures

Pradip Pramanick, Luca Raggioli, Alessandra Rossi, and Silvia Rossi

In Companion Proceedings of the 26th International Conference on Multimodal Interaction, San Jose, Costa Rica, 2024

Abs HTML PDF

Providing explanations of a robot’s behavior is a key enabler of trust in robots. Such explanations should be intuitive to people who are not experts in robotics. Prior research suggests that using multiple modalities to deliver explanations improves clarity. However, current methods for generating multimodal explanations neither assess nor ensure the coherence of the information across modalities. Here, we present an experiment to understand the effect of possible incoherence in multimodal explanations. We perform a user study asking participants to observe a series of robot failures and predict the reason for failure when provided with a controlled variation of multimodal explanations. Specifically, we present a methodology to compare incoherent and coherent explanations, aiming to understand their impact on perceiving robot failures.
ICMI

PRISCA at ERR@HRI 2024: Multimodal Representation Learning for Detecting Interaction Ruptures in HRI

Pradip Pramanick, and Silvia Rossi

In Proceedings of the 26th International Conference on Multimodal Interaction, San Jose, Costa Rica, 2024

Abs HTML PDF Code

Interaction ruptures in human-robot interaction (HRI) refer to scenarios when seamless interactions are disrupted. Such ruptures can be directly observed by the robot at times, e.g., not responding to a human utterance. However, often the ruptures could be more passive and subtle and require an analysis of the human’s behavior. In this work, we focus on detecting such ruptures by analyzing multimodal information in a face-to-face interaction setting. More specifically, this paper describes the PRISCA team’s participation in the ERR@HRI Challenge 2024, which was recently proposed to benchmark multimodal learning approaches to interaction rupture detection in HRI. Central to our approach is a feature-fusion strategy for multimodal representation learning, where we train a neural network with separate recurrent layers that act as temporal encoders to learn modality-specific representations. Our approach was ranked 3rd in the ERR@HRI challenge. We present detailed experimentation on the released dataset from the challenge and a thorough analysis of the results. We further discuss the limitations of current approaches and implications for future works. Code will be made available at https://github.com/pradippramanick/prisca-errhri/.
THRI

Enabling Social Robots to Perceive and Join Socially Interacting Groups using F-formation: A Comprehensive Overview

Hrishav Bakul Barua, Theint Haythi Mg, Pradip Pramanick, and Chayan Sarkar

ACM Transactions on Human-Robot Interaction, 2024

Just Accepted

Abs HTML

Social robots in our daily surroundings, like personal guides, waiter robots, home helpers, assistive robots, telepresence/teleoperation robots etc., are increasing day by day. Their usability and acceptability largely depend on their explicit and implicit interaction capability with fellow human beings. As a result, social behavior is one of the most sought-after qualities that a robot can possess. However, there is no specific aspect and/or feature that defines socially acceptable behavior and it largely depends on the situation, application, and society. In this article, we investigate one such social behavior for collocated robots. Imagine a group of people is interacting with each other and we want to join the group. We as human beings do it in a socially acceptable manner, i.e., within the group, we do position ourselves in such a way that we can participate in the group activity without disturbing/obstructing anybody. To possess such a quality, first, a robot needs to determine the formation of the group and then determine a position for itself, which we humans do implicitly. There are many theories which study group formations and proxemics; one such theory is F-formation which could be utilized for this purpose. As the types of formations can be very diverse, detecting the social groups is not a trivial task. In this article, we provide a comprehensive survey of the existing work on social interaction and group detection using f-formation for robotics and other applications. We also put forward a novel holistic survey framework combining some of the possibly more important concerns and modules relevant to this problem. We define taxonomies based on methods, camera views, datasets, detection capabilities and scale, evaluation approaches, and application areas. We discuss certain open challenges and limitations in current literature along with possible future research directions based on this framework. In particular, we discuss the existing methods/techniques and their relative merits and demerits, applications, and provide a set of unsolved but relevant problems in this domain.
JIRS

Teledrive: An Embodied AI Based Telepresence System

Snehasis Banerjee, Sayan Paul, Ruddradev Roychoudhury, Abhijan Bhattacharya, Chayan Sarkar, Ashis Sau, Pradip Pramanick, and Brojeshwar Bhowmick

Journal of Intelligent & Robotic Systems, 2024

Abs arXiv HTML

This article presents ‘Teledrive’, a telepresence robotic system with embodied AI features that empowers an operator to navigate the telerobot in any unknown remote place with minimal human intervention. We conceive Teledrive in the context of democratizing remote ‘care-giving’ for elderly citizens as well as for isolated patients, affected by contagious diseases. In particular, this paper focuses on the problem of navigating to a rough target area (like ‘bedroom’ or ‘kitchen’) rather than pre-specified point destinations. This ushers in a unique ‘AreaGoal’ based navigation feature, which has not been explored in depth in the contemporary solutions. Further, we describe an edge computing-based software system built on a WebRTC-based communication framework to realize the aforementioned scheme through an easy-to-use speech-based human-robot interaction. Moreover, to enhance the ease of operation for the remote caregiver, we incorporate a ‘person following’ feature, whereby a robot follows a person on the move in its premises as directed by the operator. Moreover, the system presented is loosely coupled with specific robot hardware, unlike the existing solutions. We have evaluated the efficacy of the proposed system through baseline experiments, user study, and real-life deployment.

2023

EMNLP

tagE: Enabling an Embodied Agent to Understand Human Instructions

Chayan Sarkar, Avik Mitra, Pradip Pramanick, and Tapas Nayak

In Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Abs HTML PDF Code

Natural language serves as the primary mode of communication when an intelligent agent with a physical presence engages with human beings. While a plethora of research focuses on natural language understanding (NLU), encompassing endeavors such as sentiment analysis, intent prediction, question answering, and summarization, the scope of NLU directed at situations necessitating tangible actions by an embodied agent remains limited. The inherent ambiguity and incompleteness inherent in natural language present challenges for intelligent agents striving to decipher human intention. To tackle this predicament head-on, we introduce a novel system known as task and argument grounding for Embodied agents (tagE). At its core, our system employs an inventive neural network model designed to extract a series of tasks from complex task instructions expressed in natural language. Our proposed model adopts an encoder-decoder framework enriched with nested decoding to effectively extract tasks and their corresponding arguments from these intricate instructions. These extracted tasks are then mapped (or grounded) to the robot’s established collection of skills, while the arguments find grounding in objects present within the environment. To facilitate the training and evaluation of our system, we have curated a dataset featuring complex instructions. The results of our experiments underscore the prowess of our approach, as it outperforms robust baseline models.
HRI

Utilizing Prior Knowledge to Improve Automatic Speech Recognition in Human-Robot Interactive Scenarios

Pradip Pramanick, and Chayan Sarkar

In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, Stockholm, Sweden, 2023

Abs HTML PDF

The prolificacy of human-robot interaction not only depends on a robot’s ability to understand the intent and content of the human utterance but also gets impacted by the automatic speech recognition (ASR) system. Modern ASR can provide highly accurate (grammatically and syntactically) translation. Yet, the general purpose ASR often misses out on the semantics of the translation by incorrect word prediction due to open-vocabulary modeling. ASR inaccuracy can have significant repercussions as this can lead to a completely different action by the robot in the real world. Can any prior knowledge be helpful in such a scenario? In this work, we explore how prior knowledge can be utilized in ASR decoding. Using our experiments, we demonstrate how our system can significantly improve ASR translation for robotic task instruction.

2022

EMNLP

Can Visual Context Improve Automatic Speech Recognition for an Embodied Agent?

Pradip Pramanick, and Chayan Sarkar

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Dec 2022

Abs HTML PDF

The usage of automatic speech recognition (ASR) systems are becoming omnipresent ranging from personal assistant to chatbots, home, and industrial automation systems, etc. Modern robots are also equipped with ASR capabilities for interacting with humans as speech is the most natural interaction modality. However, ASR in robots faces additional challenges as compared to a personal assistant. Being an embodied agent, a robot must recognize the physical entities around it and therefore reliably recognize the speech containing the description of such entities. However, current ASR systems are often unable to do so due to limitations in ASR training, such as generic datasets and open-vocabulary modeling. Also, adverse conditions during inference, such as noise, accented, and far-field speech makes the transcription inaccurate. In this work, we present a method to incorporate a robot’s visual information into an ASR system and improve the recognition of a spoken utterance containing a visible entity. Specifically, we propose a new decoder biasing technique to incorporate the visual context while ensuring the ASR output does not degrade for incorrect context. We achieve a 59% relative reduction in WER from an unmodified ASR system.
RAS

Talk-to-Resolve: Combining scene understanding and spatial dialogue to resolve granular task ambiguity for a collocated robot

Pradip Pramanick, Chayan Sarkar, Snehasis Banerjee, and Brojeshwar Bhowmick

Robotics and Autonomous Systems, Dec 2022

Abs arXiv HTML

The utility of collocating robots largely depends on the easy and intuitive interaction mechanism with the human. If a robot accepts task instruction in natural language, first, it has to understand the user’s intention by decoding the instruction. However, while executing the task, the robot may face unforeseeable circumstances due to the variations in the observed scene and therefore requires further user intervention. In this article, we present a system called Talk-to-Resolve (TTR) that enables a robot to initiate a coherent dialogue exchange with the instructor by observing the scene visually to resolve the impasse. Through dialogue, it either finds a cue to move forward in the original plan, an acceptable alternative to the original plan, or affirmation to abort the task altogether. To realize the possible stalemate, we utilize the dense captions of the observed scene and the given instruction jointly to compute the robot’s next action. We evaluate our system based on a data set of initial instruction and situational scene pairs. Our system can identify the stalemate and resolve them with appropriate dialogue exchange with 82% accuracy. Additionally, a user study reveals that the questions from our systems are more natural (4.02 on average on a scale of 1 to 5) as compared to a state-of-the-art (3.08 on average).
RA-L

DoRO: Disambiguation of Referred Object for Embodied Agents

Pradip Pramanick, Chayan Sarkar, Sayan Paul, Ruddra dev Roychoudhury, and Brojeshwar Bhowmick

IEEE Robotics and Automation Letters, Dec 2022

Abs arXiv HTML

Robotic task instructions often involve a referred object that the robot must locate (ground) within the environment. While task intent understanding is an essential part of natural language understanding, less effort is made to resolve ambiguity that may arise while grounding the task. Existing works use vision-based task grounding and ambiguity detection, suitable for a fixed view and a static robot. However, the problem magnifies for a mobile robot, where the ideal view is not known beforehand. Moreover, a single view may not be sufficient to locate all the object instances in the given area, which leads to inaccurate ambiguity detection. Human intervention is helpful only if the robot can convey the kind of ambiguity it is facing. In this article, we present DoRO (Disambiguation of Referred Object), a system that can help an embodied agent to disambiguate the referred object by raising a suitable query whenever required. Given an area where the intended object is, DoRO finds all the instances of the object by aggregating observations from multiple views while exploring & scanning the area. It then raises a suitable query using the information from the grounded object instances. Experiments conducted with the AI2Thor simulator show that DoRO not only detects the ambiguity more accurately but also raises verbose queries with more accurate information from the visual-language grounding.
COMSNETS

Teledrive: An Intelligent Telepresence Solution for “Collaborative Multi-presence” through a Telerobot

Abhijan Bhattacharyya, Ashis Sau, Ruddra Dev Roychoudhury, Snehasis Banerjee, Chayan Sarkar, Pradip Pramanick, Madhurima Ganguly, Brojeshwar Bhowmick, and B Purushothaman

In 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), Jan 2022

Abs HTML

This paper presents an Edge-centric architecture along with a novel communication topology for a practical robotic telepresence solution. The system has been experimented in real-life. The subjective user experience is quantified through a simple yet effective technique. The efficacy of the protocol is also proven through experiments in practical deployment.

2020

IROS

DeComplex: Task planning from complex natural instructions by a collocating robot

Pradip Pramanick, Hrishav Bakul Barua, and Chayan Sarkar

In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2020

Abs arXiv HTML

As the number of robots in our daily surroundings like home, office, restaurants, factory floors, etc. are increasing rapidly, the development of natural human-robot interaction mechanism becomes more vital as it dictates the usability and acceptability of the robots. One of the valued features of such a cohabitant robot is that it performs tasks that are instructed in natural language. However, it is not trivial to execute the human intended tasks as natural language expressions can have large linguistic variations. Existing works assume either single task instruction is given to the robot at a time or there are multiple independent tasks in an instruction. However, complex task instructions composed of multiple inter-dependent tasks are not handled efficiently in the literature. There can be ordering dependency among the tasks, i.e., the tasks have to be executed in a certain order or there can be execution dependency, i.e., input parameter or execution of a task depends on the outcome of another task. Understanding such dependencies in a complex instruction is not trivial if an unconstrained natural language is allowed. In this work, we propose a method to find the intended order of execution of multiple inter-dependent tasks given in natural language instruction. Based on our experiment, we show that our system is very accurate in generating a viable execution plan from a complex instruction.
RO-MAN

Let me join you! Real-time F-formation recognition by a socially aware robot

Hrishav Bakul Barua, Pradip Pramanick, Chayan Sarkar, and Theint Haythi Mg

In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Aug 2020

Abs arXiv HTML

This paper presents a novel architecture to detect social groups in real-time from a continuous image stream of an ego-vision camera. F-formation defines social orientations in space where two or more person tends to communicate in a social place. Thus, essentially, we detect F-formations in social gatherings such as meetings, discussions, etc. and predict the robot’s approach angle if it wants to join the social group. Additionally, we also detect outliers, i.e., the persons who are not part of the group under consideration. Our proposed pipeline consists of – a) a skeletal key points estimator (a total of 17) for the detected human in the scene, b) a learning model (using a feature vector based on the skeletal points) using CRF to detect groups of people and outlier person in a scene, and c) a separate learning model using a multi-class Support Vector Machine (SVM) to predict the exact F-formation of the group of people in the current scene and the angle of approach for the viewing robot. The system is evaluated using two data-sets. The results show that the group and outlier detection in a scene using our method establishes an accuracy of 91%. We have made rigorous comparisons of our systems with a state-of-the-art F-formation detection system and found that it outperforms the state-of-the-art by 29% for formation detection and 55% for combined detection of the formation and approach angle.

2019

RO-MAN

Your instruction may be crisp, but not clear to me!

Pradip Pramanick, Chayan Sarkar, and Indrajit Bhattacharya

In 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Oct 2019

Abs arXiv HTML

The number of robots deployed in our daily surroundings is ever-increasing. Even in the industrial setup, the use of coworker robots is increasing rapidly. These cohabitant robots perform various tasks as instructed by co-located human beings. Thus, a natural interaction mechanism plays a big role in the usability and acceptability of the robot, especially by a non-expert user. The recent development in natural language processing (NLP) has paved the way for chatbots to generate an automatic response for users’ query. A robot can be equipped with such a dialogue system. However, the goal of human-robot interaction is not focused on generating a response to queries, but it often involves performing some tasks in the physical world. Thus, a system is required that can detect user intended task from the natural instruction along with the set of pre- and post-conditions. In this work, we develop a dialogue engine for a robot that can classify and map a task instruction to the robot’s capability. If there is some ambiguity in the instructions or some required information is missing, which is often the case in natural conversation, it asks an appropriate question(s) to resolve it. The goal is to generate minimal and pin-pointed queries for the user to resolve an ambiguity. We evaluate our system for a telepresence scenario where a remote user instructs the robot for various tasks. Our study based on 12 individuals shows that the proposed dialogue strategy can help a novice user to effectively interact with a robot, leading to satisfactory user experience.
IROS

Enabling Human-Like Task Identification From Natural Conversation

Pradip Pramanick, Chayan Sarkar, P Balamuralidhar, Ajay Kattepur, Indrajit Bhattacharya, and Arpan Pal

In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Nov 2019

Abs arXiv HTML

A robot as a coworker or a cohabitant is becoming mainstream day-by-day with the development of low-cost sophisticated hardware. However, an accompanying software stack that can aid the usability of the robotic hardware remains the bottleneck of the process, especially if the robot is not dedicated to a single job. Programming a multi-purpose robot requires an on the fly mission scheduling capability that involves task identification and plan generation. The problem dimension increases if the robot accepts tasks from a human in natural language. Though recent advances in NLP and planner development can solve a variety of complex problems, their amalgamation for a dynamic robotic task handler is used in a limited scope. Specifically, the problem of formulating a planning problem from natural language instructions is not studied in details. In this work, we provide a non-trivial method to combine an NLP engine and a planner such that a robot can successfully identify tasks and all the relevant parameters and generate an accurate plan for the task. Additionally, some mechanism is required to resolve the ambiguity or missing pieces of information in natural language instruction. Thus, we also develop a dialogue strategy that aims to gather additional information with minimal question-answer iterations and only when it is necessary. This work makes a significant stride towards enabling a human-like task understanding capability in a robot.

2018

RO-MAN

DeFatigue: Online Non-Intrusive Fatigue Detection by a Robot Co-Worker

Pradip Pramanick, and Chayan Sarkar

In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Aug 2018

Abs HTML PDF

A robot as a companion or co-worker is not an emerging concept anymore, but a reality. However, one of the major barriers to this realization is the seamless interaction with the robots that includes both explicit and implicit interaction. In this work, we assume a use-case where a human and a robot together carry a heavy object in a co-habitat (home or workplace/factory). Two human beings while doing such a work understands each other without explicit (vocal) interaction. To realize such behavior, the robot must understand the fatigue state of the human co-worker to enable seamless work experience and ensure safety. In this article, we present DeFatigue, a non-intrusive fatigue state detection mechanism. We assume that the robot’s hand is equipped with a force sensor. Based on the change of force from the human side while carrying the object, DeFatigue is able to determine the fatigue state without instrumenting the human being with an additional sensor (internally or externally). Moreover, it detects the fatigues state on-the-fly (online) as well as it does not require any (user-specific) training. Based on our experiments with 18 test subjects, fatigue state detection by DeFatigue overlaps with the ground truth for 85.18% of the cases whereas it deviates 4.09 s (on average) for the remaining cases.

2017

ANTS

NoiseSense: Crowdsourced context aware sensing for real time noise pollution monitoring of the city

Joy Dutta, Pradip Pramanick, and Sarbani Roy

In 2017 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), Dec 2017

Abs HTML PDF

Noise pollution in urban areas is a subject of grave concern and it is being recognized globally in different countries and cities. People are facing many health-related problems because of this. Therefore, in the proposed work, we envisioned to tackle the challenge of acquiring real time and spatially fine-grained noise pollution data with a community-driven sensing infrastructure. Mobile crowdsourcing over smartphones presents a new paradigm for collecting context aware sensing data of a vast area like a city. Thus, the proposed system exploits the power of mobile crowdsourcing. The proposed system monitors the present noise level in the surroundings of the user and also generates city’s noise pollution footprints. The noise map reflects the real-time pollution scenario of the city which changes with time. The prototype of the system has been evaluated with extensive experiments based on crowdsourced sensing data collected by volunteers in Kolkata city.