PhD Student, Intelligent Systems Program, University of Pittsburgh
13 Oct 2020 - Arun Balajiee
In this talk, Dr. Bamman talked about constructing a dataset that could improve the performance of NLP models on novels. The current systems that use BERT or other models in their NLP pipeline are trained on ACE datasets (a collection of news articles from various sources). These models can perform really well on these datasets on evaluation, but fail to perform well on datasets that are not sourced from mainstream and social media – such as novels. A similar trend is noticeable when these NLP models that are evaluated on text datasets outside of a specific language construct as in German. A very common use case in the field of applied NLP in fiction is to use various datasets such as BookCorpus, NarrativeQA, Commonsense stories (a crowdsourced dataset of very short stories ) and Google Books (for historical development of language over the years). Modelling literary phenomena is the other side of the problem that is targeted by these NLP models built on datasets sourced from fiction.
The crux of the talk was about building a dataset that is annotated to handle a broader set of entities (Nested Entitiy Recognition) which are not commonly seen in other datasets, removing, coreferencing character arcs, propagation of information among different characters in a story, quotation attribution and similar use cases in this directions. For this, Dr. Bamman et al. constructed a LitBank dataset which is annotated with entity tags, coreferences, events and quotation attribution.
Using this annotation and with a BERT model , Dr. Bamman talked about diving into the aspects of chatarecterization and the imrpovements of performance to go on par or better than a system that is trained using ACE corpus and evaluated on a literary piece. Further Dr.Bamman, visually and through the results from the performance evaluation of his model, showed hte understanding of information propagation from one character to the other and the notion of identifying metaphors. Dr. Bamman, then, briefly touched upon the gender bias and stereotypes also noticed in novels and the act of mitigation through identification of these different instances
Finally he added a link to the dataset in his presentation
Some things that I plan to understand a bit more after this presentation is the idea of affect modelling (emotions of different characters based on information propagation) and affect distribution based on novels with more modern texts and uses of this idea in educational technologies.