IBM spent the past six years building an AI that can hold a structured discussion with a human at an almost human level of achievement. IBM demonstrated its Project Debater AI at its Think event on February 11 in a live on-stage debate with an expert human debater.
Project Debater lost the demonstration debate. However, as with games such as chess and Go, it is only a matter of time before the AI starts winning arguments with the best of humans.
At the February debate, I believe IBM’s Project Debater stepped into the “uncanny valley” of natural language processing (NLP). What will that mean for humanity, and more importantly, for commerce?
What is the Uncanny Valley for Natural Language Processing?
The closer a robot or simulation gets to looking and behaving like a human, the less tolerant humans become of visual or behavioral mistakes. This is called the uncanny valley.
The uncanny valley effect is related to suspension of disbelief in entertainment. We easily tolerate major violations of physics and physiology when watching cartoons, but as simulated actors become more realistic, we expect their physics and physiology to behave correctly.
As a robot or simulation more accurately depicts human looks and physiology, it can pass through the uncanny valley and be mistaken for human. The word “deepfake” was recently coined to describe such hyper-realistic simulations of humans and the sense of betrayal that people feel upon discovering they can no longer tell the difference between real and simulated humans.
For NLP, the uncanny valley is the point at which a human listener realizes that they had started to anthropomorphize the AI, but then something went wrong with the AI’s response and betrayed its non-humanness. The listener may not fully identify what is wrong, but they perceive that the conversation went sideways and the AI is no longer responding appropriately. In a double-blind test, perhaps a human listener might think the AI is a crazy or unstable human.
Setting the Scene
Project Debater was run on IBM Research’s lab on-prem servers and storage located in Haifa, Israel and mirrored in a redundant deployment of the system running in IBM Cloud.
Project Debater’s opponent was Harish Natarajan, a debate champion. Natarajan was a grand finalist at the 2016 World Debating Championships and winner of the European Debating Championship in 2012. He holds the world record number of debate competition victories.
1. Resolution & Position Notification. The opponents are notified of the debate topic, or “resolution” shortly before the debate starts, and both sides have only fifteen minutes to prepare. In a typical debate, opponents would also be notified of which position they will be taking on the resolution. For consistency, Project Debater always argues for the resolution, but it can argue either side. Neither side could use the internet to prepare for the debate.
The resolution for this debate was “We should subsidize preschools”. Project Debater was not trained to address this topic specifically and was arguing for the resolution. Project Debater’s creators gave it a very large body of random knowledge to draw from—400 million articles from newspapers, magazines, and journals—so it was up to the AI to argue using information it already had access to. It did not have access to the open internet during the debate (nor apparently at any time during its development, probably a good thing for a developing AI mind).
While the opponents crafted their opening arguments, the debate moderator polled the audience to find out how many of us agreed with and disagreed with the resolution. This sets an objective baseline for judging the winner of the debate.
2. Opening Arguments, four minutes per opponent. Each opponent presented their side of the topic with prepared remarks. The opponent that speaks later does not respond to the first speaker’s positions. Project Debater spoke first and did well in this phase of the debate, assembling an impressive body of knowledge to support and defend the resolution.
3. Moderator Summary. The moderator then took almost five minutes to summarize the two opinions while the opponents framed their first rebuttal. However, Project Debater only needed about two minutes to prepare its rebuttal.
4. Rebuttal, four minutes per opponent. Each opponent responded directly to the opponent’s opening arguments. Intelligence Squared’s normal debate process is to hold a discussion panel with the two opponents directly conversing with each other, followed by an audience question and answer period. However, Project Debater cannot yet hold a real-time conversation with a person, so the debate format was adjusted to give the AI time to think, so to speak, and to clarify its position through this additional round of rebuttals.
This is when Project Debater noticeably stepped into the NLP uncanny valley. Project Debater did not understand several nuances of Mr. Natarajan’s arguments and continued to barrage Mr. Natarajan with a bewildering array of direct quotes and numeric statistics. If a human had delivered Project Debater’s rebuttal, I would have questioned the veracity of the information, because no human can recall an encyclopedic knowledge base as deep as Project Debater demonstrated with accuracy, precision, and speed.
Note that after the debate, the audience was asked “which debater enriched your knowledge”. Project Debater won nearly 60% to Mr. Natarajan’s 20%. This may be due to a combination of human audience members’ cognitive bias that computers cannot lie and the copious amounts of detailed and official sounding data supplied by Project Debater.
5. Final Rebuttal Preparation. Both opponents were given another four-and-a-half minutes to frame their final rebuttals and closing arguments. Again, Project Debater only needed two minutes to prepare its final rebuttal and closing remarks.
6. Summary / Closing Remarks, two minutes per opponent.Each opponent reacted to the others’ arguments and then framed their final position on the debate resolution. It was here that Mr. Natarajan’s more “common sense” approach won the debate in the face of Project Debater’s very dry and extremely precise summary of the data it provided (anthropomorphizing a bit, it seemed proud of the copious data points it had supplied), capped by a tenuously related and heavy-handed quote by Benjamin Disraeli.
7. Outcome. After the debate, the moderator polled the audience again to measure how many audience members changed their position on the resolution. More of the audience agreed with Mr. Natarajan’s position after the debate than at the beginning, so he was declared the winner. Seventeen percent of the audience shifted their positions, though I can’t help but think there was a little selection bias from the human audience in favor of the flagrantly human opponent.
I spoke with Noam Slonim, IBM Research’s Principal Investigator for Project Debater, a few days after the debate.
The proposal for Project Debater was single PowerPoint slide delivered about eight years ago in 2011. In 2012 IBM Research established an investigation team, and “intense” work began on Project Debater in 2014.
Mr. Slonim assesses Project Debater’s progress over the past few years using grade levels:
- Its first debate was in early 2016, and it had not yet reached an elementary school level of argument.
- In 2017 it achieved a high school level of debate.
- In 2018, it achieved a college level of debate.
- And in early 2019, at this debate, it showed competent university-level debate skills.
Into the Valley
My assessment of Project Debater’s potential is that in early 2019, IBM’s Project Debater reached the uncanny valley for Natural Language Processing. But it won’t stop there.
My first thought was that Project Debater might be the ultimate anti-troll countermeasure. With a huge body of verifiable facts at its disposal and a practically unlimited amount of time, it should be able to easily counter every fake “fact” presented by a troll. And it will never tire of adding one more response to an argument. Automated systems are a great answer for pushing back the last word in a never-ending argument.
If someone were to combine Project Debater’s ability to understand the context of a human discussion and respond appropriately with OpenAI’s recently announced ability to tell a coherent but completely fake story, that would lead AI down a dark and completely sociopathic path. Note that IBM is at the forefront of AI ethics and fairness research, but some governments and other sovereign actors might see an opportunity here.
Given Project Debater’s improvements over the past three years, it’s also possible that it might evolve its debate skills to become the world’s best informed, most polite and most persistent salesperson. That could happen within just a few years. That is also a somewhat scary thought. If you have trouble saying ‘no’ to salespeople now, that could become much harder.
The Real Story
After I wrote all of the above, I asked IBM about its future plans for Project Debater. The public debate phase of Project Debater is over. Its performance in San Francisco at IBM Think was credible, and I don’t think IBM has much to gain in trying to attain a Jeopardy-style win over humans.
IBM will commercialize this technology. The first incarnation is “Project Debater – Speech by Crowd”. This product will use Project Debater to crowd-source arguments around a specific topic. Then it will weave those arguments into two polar positions in the argument: one position for and one position against the topic. You can view more about the process here.
The challenge in this approach is the Speech by Crowd system does not appear to be validating the positions it receives during the crowd-sourcing phase. IBM has already used controversial topics as examples of the system’s potential, such as “flu vaccination should be mandatory” and “we should ban the sale of violent video games to minors”. However, many people with entrenched positions on these controversial topics have internalized disinformation and fake data they’ve received through sources they trust.
Not all opinions are valid, and therefore not all opinions deserve the same consideration. Some opinions are scientifically provable as wrong. Some opinions are simply unsupported innuendo. Project Debater could make a significant contribution to the quality of public discourse and debate by matching opinionated arguments (pro and con) with verifiable and unbiased research and sources. But IBM needs to build-in that capability. Simply repeating arguments that show up most often (or however Project Debater – Speech by Crowd weighs its crowd-sourced input) will make speech writing easier, but it won’t do anything to improve the quality of the facts in the speech.
This is a unique point in history for IBM to deliver a solution that helps cut through the clutter to enable better public policy setting and decision making.
You can view the recorded IBM Think debate for yourself here.
This article was written for Forbes.com, to view the original article click here.