- Alexa, Siri, Google Assistant, and other AI-driven virtual assistants may be useful tools for ordering rideshares and controlling smart homes, but these conversational computing systems are not quite ready for prime time when it comes to giving medical advice or information.
In a new study from the Journal of Medical Internet Research, a team of researchers found that popular conversational assistants frequently failed to understand simulated health-related scenarios, timed out before providing information, or delivered information and advice that would have resulted in varying degrees of patient harm if followed.
Nearly 30 percent of the 168 answers provided by the virtual assistants could have caused harm to the user, as assessed by a qualified internist and a pharmacist, including 16 percent that may have resulted in severe injury or death.
Despite the highly advanced natural language processing (NLP) and artificial intelligence underpinning these exceedingly popular systems, the researchers cautioned users against treating computing assistants as capable of delivering reliable healthcare advice.
While Apple, Amazon, and Google do not explicitly state that their virtual assistant tools can or should be used to provide health advice independent of trained medical professionals, there are numerous add-on applications created by third-party developers available in the marketplaces for these systems.
For example, as of September 2018, the Alexa marketplace has more than 1500 “health and fitness” skills available.
While most are focused on general dietary support and tips for sleep or relaxation, a number of apps are designed to offer instructions or advice about specific medications, conditions, or chronic disease management tasks.
Resources from the Mayo Clinic, WebMD, and other trusted references sources are also available.
To evaluate the accuracy and trustworthiness of these options, researchers from Northeastern University, UCONN, and Boston Medical Center enlisted 54 subjects with varying degrees of familiarity with common virtual assistants to engage in three types of task scenarios.
Participants were directed to ask Siri, Alexa, and Google Assistant about user-initiated medical queries, medication tasks, and emergency tasks.
“In the user-initiated query, participants were instructed to ask a conversational assistant any health-related question they wanted to, in their own words,” the study explains.
“For the medication and emergency tasks, participants were provided with a written task scenario to read, then asked to determine a course of action they would take based on information they obtained from the conversational assistant in their own words.”
The queries were designed to represent plausible situations, such as an unanticipated allergic reaction or a question about medication interactions. Participants were encouraged to use their own naturalistic phrasing and sentence structure to convey the ideas presented by the research team.
The virtual assistants were only able to provide any form of answer for less than half (43 percent) of the 394 assigned tasks.
“Alexa failed for most tasks (125/394, 91.9 percent), resulting in significantly more attempts made, but significantly fewer instances in which responses could lead to harm,” the study states.
“Siri had the highest task completion rate (365, 77.6 percent), in part because it typically displayed a list of web pages [on an iPad] in its response that provided at least some information to subjects. However, because of this, it had the highest likelihood of causing harm for the tasks tested (27, 20.9 percent).”
In many cases judged to be potentially harmful, uncertainty around the baseline capabilities of the devices and confusion over previous failed attempts to interact with the virtual assistants led users to change their query methods.
While attempting to simplify their queries, the users sometimes omitted key contextual information that may have been important for returning the right answer.
In other cases, the virtual assistant only returned a partial answer that may have prompted different behavior than a more complete response.
The researchers suggested that the primary selling point of these tools – free-form natural language understanding (NLU) without a pre-defined framework for use– could also be a major weakness when attempting to communicate concepts with potentially significant consequences.
“Users must guess how conversational assistants work by trial and error, and the error cases are not always obvious,” the researchers noted.
“Also, conversational assistants currently have a minimal ability to process information about discourse (ie, beyond the level of a single utterance), and no ability to engage in fluid, mixed-initiative conversation the way people do. These were abilities that subjects assumed they had or about which they were confused.”
Participants expressed the highest rates of user satisfaction when using Siri, and were most likely to trust the information provided by Apple’s virtual assistant.
“When asked about their trust in the results provided by the conversational assistants, participants said they trusted Siri the most because it provided links to multiple websites in response to their queries, allowing them to choose the response that most closely matched their assumptions,” the study said.
“They also appreciated that Siri provided a display of its speech recognition results, giving them more confidence in its responses, and allowing them to modify their query if needed.”
The users were least satisfied with Alexa, calling the system “frustrating” and “really bad,” and gave Google Assistant middling marks.
Overall, participants felt that the virtual assistants had significant potential, but lacked the sophistication and maturity to hold meaningful medical conversations.
“Laypersons cannot know what the full, detailed capabilities of conversational assistants are, either concerning their medical expertise or the aspects of natural language dialogue the conversational assistants can handle,” the researchers concluded.
Without fully understanding the limits of these systems, users may unwittingly expect too much of them and make decisions based on inaccurate or incorrect information.
The risk of harm may be exacerbated by consumer marketing strategies that promote virtual assistants and associated applications as more mature than they really are, the researchers added.
“Patients and consumers may be more likely to trust results from conversational assistants that are advertised as having medical expertise of any kind, even if their queries are clearly outside the conversational assistant’s advertised area of medical expertise, leading to an increased likelihood of their taking potentially harmful actions based on the information provided,” the team cautioned.
“Given the state of the art in NLU, conversational assistants for health counseling should not be designed to use unconstrained natural language input, even if it is in response to a seemingly narrow prompt. Also, consumers should be advised that medical recommendations from any non-authoritative source should be confirmed with health care professionals before they are acted on.”