Thanks to social media, spoilers are everywhere. The past few years have been particularly terrible for them: Game of Thrones, the MCU films, whatever the latest Netflix original is – they’re all loaded with juicy details that some fans just love to spoil for the rest of us. At this point, it’s becoming almost impossible to avoid.
Thankfully, someone has finally found a way to warn us of them.
This year, a team from UC San Diego developed a neural network that can detect key indicators of sensitive information and alert readers of the potential danger ahead.
“Spoilers are everywhere on the internet, and are very common on social media. As internet users, we understand the pain of spoilers, and how they can ruin one’s experience,” said Ndapa Nakashole, a computer science professor at UC San Diego.
Dubbed ‘SpoilerNet’, the artificial intelligence tool was trained to spot unwanted reveals by combing through big data on Goodreads. The popular book review platform provided an excellent data source, as it has a function that allows users to include spoiler tags in their posts. More than 1.3 million tagged reviews were combed through, and the linguistic patterns were analysed.
The dataset revealed that spoiler sentences tended to be grouped together in the latter part of reviews, but this was not a universal rule. As could probably be expected, different users had different ideas of what counted as a spoiler (and probably always will) – so it is difficult to train a neural network to be flawless in its detection skills.
What’s more, the actual semantics of the words used differed greatly between reviews. In a report from TechCrunch, it’s noted that “the model occasionally mistakes a sentence as having spoilers if other spoiler-ish sentences are adjacent; and its understanding of individual sentences is not quite good enough to understand when certain words really indicate spoilers or not. You and I know that ‘this kills Darth Vader’ is a spoiler, while ‘this kills the suspense’ isn’t, but a computer model may have trouble telling the difference.”
The specific context of language is also difficult to understand for a machine. In one book, for example, a character dying might be a major plot twist – but in another, it could be the starting off point for a story. Without being able to understand the minutiae of what’s being spoken about, the data is much harder to interpret.
Still, the vastness of the information available was invaluable in helping to develop the tool.
“To our knowledge, this is the first dataset with spoiler annotations at this scale and at such a fine-grained granularity,” said Mengting Wan, a Ph.D. student and lead author on the research paper.
Once the training was done, the machine’s capabilities were tested using different data sets – some from Goodreads, and others from TV Tropes. Astoundingly, the AI was able to label chunks of text as either ‘spoiler’ or ‘non-spoiler’ with a 92% accuracy rate.
“Such a model design indeed benefits from the new large-scale review dataset we collected for this work, which includes complete review documents, sentence-level spoiler tags, and other meta-data,” Wan said.
He continued: “To our knowledge, the public dataset (released in 2013) before this work only involves a few thousand single-sentence comments rather than complete review documents. For research communities, such a dataset also facilitates the possibility of analyzing real-world review spoilers in details as well as developing modern ‘data-hungry’ deep learning models in this domain.”
With smaller chunks of information, however, the machine did struggle. When given a dataset of over 16,000 summaries of just under 900 different TV shows, the accuracy with which it detected spoilers dropped to between 74 and 80%. Again, most of the errors were down to the machine not understanding the semantics or context of often-loaded terms such as ‘kill’ or ‘dies’.
Once the kinks are ironed out, it is possible that SpoilerNet could become a programme that runs in real-time – perhaps as a browser plugin or app. The network would then be able to read ahead of unwitting fans, and hide anything it thinks could give away unwanted details. Right now, however, there are no plans to commercialise the software.
If you want to read more about the UC San Deigo team’s research, you can check it out here.