AI-powered job interview software may be just as bullshit as you suspect, according to tests run by the MIT Technology Review’s “In Machines We Trust” podcast that found two companies’ software gave good marks to someone responding to an English-language interview in German.
Companies that advertise software tools powered by machine learning for screening job applicants promise efficiency, effectiveness, fairness, and the elimination of shoddy decision-making by humans. In some cases, all the software does is read resumes or cover letters to quickly determine if an applicant’s work experience appears right for the job. But a growing number of tools require job-seekers to navigate a hellish series of tasks before they even come close to a phone interview. These can range from having conversations with a chatbot to submitting to voice/face recognition and predictive analytics algorithms that judge them based on their behavior, tone, and appearance. While the systems might save human resources staff time, there’s considerable skepticism that AI tools are anywhere near as good (or unbiased) at screening applicants as their developers claim.
The Technology Review’s tests add more weight to those concerns. They tested two AI recruiting tools: MyInterview and Curious Thing. MyInterview ranks applicants based on observed traits associated with the Big Five Personality Test—openness, conscientiousness, extroversion, agreeableness, and emotional stability. (While the Big Five is widely used in psychiatry, Scientific American reported that experts say its use in commercial applications is iffy at best and often flirts with pseudoscience.) Curious Thing also measures other personality traits such as “humility and resilience.” Both tests then offer assessments, with MyInterview comparing those scores to the characteristics hiring managers say they prefer.
To test these systems, the Technology Review created fake job postings for an office administrator/researcher on both apps and constructed fake candidates they believed would fit the role. The site wrote:
On MyInterview, we selected characteristics like attention to detail and ranked them by level of importance. We also selected interview questions, which are displayed on the screen while the candidate records video responses. On Curious Thing, we selected characteristics like humility, adaptability, and resilience.
One of us, [Hilke Schellmann], then applied for the position and completed interviews for the role on both MyInterview and Curious Thing.
On Curious Thing, Schellmann completed one video interview and received an 8.5 out of 9 for English competency. But when she retook the test, reading answers straight off the German-language Wikipedia page on psychometrics, it returned a 6 out of 9 score. According to the Technology Review, she then retook the test with the same approach and got a 6 out of 9 again. MyInterview performed similarly, ranking Schellmann’s German-language video interview at a 73% match for the job (putting her in the upper half of applicants recommended by the site).
MyInterview also transcribed Schellmann’s answers on the video interview, which the Technology Review wrote was pure gibberish:
So humidity is desk a beat-up. Sociology, does it iron? Mined material nematode adapt. Secure location, mesons the first half gamma their Fortunes in for IMD and fact long on for pass along to Eurasia and Z this particular location mesons.
While HR staff might catch the garbled transcript, this is concerning for obvious reasons. If an AI can’t even distinguish that a job applicant isn’t speaking in English, then one can only speculate as to how it might handle an applicant speaking English with a heavy accent, or just how it is deriving personality traits from the responses. Other systems that rely on even more dubious metrics, like facial expression analysis, may be less trustworthy. (One of the firms that used expression analysis to determine cognitive ability, HireVue, stopped doing so in the last year after the Federal Trade Commission accused it of “deceptive or unfair” business practices.) As the Technology Review noted, most companies that build such tools treat knowledge of how they work on a technical basis as trade secrets, meaning they’re extremely difficult to externally vet.
Even text-based systems are prone to bias and questionable results. LinkedIn was forced to overhaul its algorithm that matched job candidates with opportunities, and Amazon reportedly ditched an internally developed resume-reviewing software, after finding in both cases that computers continued discriminating against women. In the case of Amazon, sometimes the software allegedly recommended unqualified applicants at random.
Clayton Donnelly, an industrial and organizational psychologist that works with MyInterview, told the Technology Review the site scored Schellmann’s personality results on the intonation of her voice. Rice University professor of industrial-organizational psychiatry Fred Oswald told the site that was a BS metric: “We really can’t use intonation as data for hiring. That just doesn’t seem fair or reliable or valid.”
Oswald added that “personality is hard to ferret out in this open-ended sense,” referring to the loosely structured video interview, whereas psychological testing mandates “the way the questions are asked to be more structured and standardized.” But he told the Technology Review he didn’t believe current systems had gathered the data to make those decisions accurately or even that they had a reliable method for collecting it in the first place.
Sarah Myers West, who works on the social implications of AI at New York University’s AI Now Institute, told the Chicago Tribune earlier this year, “I don’t think the science really supports the idea that speech patterns would be a meaningful assessment of someone’s personality.” One example, she said, is that historically AIs have performed worse when trying to understand women’s voices.
Han Xu, the co-founder and chief technology officer of Curious Thing, told the Technology Review this was actually a great result as it “is the very first time that our system is being tested in German, therefore an extremely valuable data point for us to research into and see if it unveils anything in our system.”