PaperProposal/proposal.tex at master · Carouge/PaperProposal · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
\documentclass[sigplan]{acmart}

\settopmatter{printacmref=false} % Removes citation information below abstract
\renewcommand\footnotetextcopyrightpermission[1]{} % removes footnote with conference information in first column
\pagestyle{plain} % removes running headers

\usepackage[english]{babel}
\usepackage{url}

\begin{document}

\title{Abstractive Scientific Text Summarization using Generative Adversarial Networks}

\author{Maria Dobko}
\affiliation{%
  \institution{Ukrainian Catholic University\\
  Faculty of Applied Sciences}
  \city{Lviv}
  \state{Ukraine}
}
\email{dobko_m@ucu.edu.ua}

\author{Oleksandr Zaytsev}
\affiliation{%
  \institution{Ukrainian Catholic University\\
  Faculty of Applied Sciences}
  \city{Lviv}
  \state{Ukraine}
}
\email{oleks@ucu.edu.ua}

\author{Yuriy Pryima}
\affiliation{%
  \institution{Ukrainian Catholic University\\
  Faculty of Applied Sciences}
  \city{Lviv}
  \state{Ukraine}
}
\email{y.pryima@ucu.edu.ua }

\begin{abstract}
Generative adversarial networks (GAN) have shown a lot of success in image generation. However until recent years they were considered inapplicable to the discrete problems of natural language processing (NLP). The latest papers introduce novel approaches to overcoming these issues by combining GANs with reinforcement learning models and lay the foundation for the whole new field of research of adversarial language processing.

In our research we will apply GANs to the task of scientific text summarization. We will take the whole text of a paper and produce a shorter text that contains a concise summary of the research. Sophisticated structure of scientific texts and the availability of many additional sources of contextual information (such as the referenced papers and the topics of other papers written by the authors). We will compare the performance of several discrete GANs that performed well on the similar problems of text generation, and try to improve the results of recent state-of-the-art approaches to scientific text summarization.
\end{abstract}

\keywords{text summarization, NLP, GAN, reinforcement learning}

\maketitle

\section{Introduction}
Recent studies have shown that neural networks can be used for solving NLP problems. However, models that were mostly considered for this task were convolutional neural networks and recurrent neural networks.

Applying generative adversarial networks to the problems of NLP is a considered to be a complicated task because GANs are only defined for real-valued data, and all NLP is based on discrete values like words, characters, or bytes.

\begin{quote}
For example, if you output an image with a pixel value of 1.0, you can change that pixel value to 1.0001 on the next step. If you output the word "penguin", you can't change that to "penguin + .001" on the next step, because there is no such word as "penguin + .001". You have to go all the way from "penguin" to "ostrich".\footnote{Ian Goodfellow's answer to the related question on Reddit: \url{https://www.reddit.com/r/MachineLearning/comments/40ldq6/generative_adversarial_networks_for_text/}}
\end{quote}

However, in their latest paper Fedus, Goodfellow, and Dai\cite{fedus-18} overcome this problem by using reinforcement learning to train the generator while the discriminator is still trained via maximum likelihood and stochastic gradient descent, and use it to fill the gaps in the text.

However, this approach has not been used for abstract text summarization of scientific papers, yet.

\section{Problem statement}

The main hypothesis is that generative adversarial networks with reinforcement learning based generator, when applied to the problem of abstractive scientific text summarization, can provide better results than recent state-of-the-art approaches.

\section{Related work}

Allahyari et al.\cite{allahyari-17} make a survey of the most successful text summarization techniques as of July 2017.

This September Li et al. \cite{li-cohn-17} described their submission to the sentiment analysis sub-task of ?Build It, Break It: The Language Edition (BIBI)? where they successfully apply generative approach to the problem of sentiment analysis.

In their paper \textit{Generative Adversarial Network for Abstractive Text Summarization} Liu et al.\cite{liu-17} built an adversarial model that achieved competitive ROUGE scores with the state-of-the-art methods on CNN/Daily Mail dataset. They compare the performance of their approach with three methods, including the abstractive model, the pointer-generator coverage networks, and the abstractive deep reinforced model.

In contrast Zhang et al.\cite{zhang-17} don't use reinforcement learning but rather introduce TextGAN with an LSTM-based generator and kernelized discrepancy metric.

\section{Research design and methods}

We will be applying existing discrete GANs to the data collected from arXiv. One of the lates successful models  for scientific text summarization will be selected as our baseline.

\subsection{Data collection}

arXiv provides a RESTful API\footnote{\url{https://arxiv.org/help/api/index}} that allows us to search for papers from a specific category and inside a specific time range. The results are returned as an HTML page which can be easily parsed. By making requests to arXiv and parsing the response we acquire all necessary information about the paper (id, title, authors, date, subjects, abstract etc.)\footnote{Our first attempts of data scrapping from arXiv:\\ \url{https://github.com/MachineLearningUCU/arXiv-parsing}}. Then we use the collected list of paper ids to download PDF files, extract text from those files and store it in a table together with other variables acquired from arXiv.

\subsection{Timeframes}

\paragraph{Deliverables for the $1^{st}$ evaluation}
\begin{itemize}
\item Dataset of papers collected from arXiv
\item Results of feature extraction
\item Implementations of the baseline state-of-the-art model and its application to our dataset
\end{itemize}

\paragraph{Deliverables for the $2^{nd}$ (final) evaluation}
\begin{itemize}
\item Implementation of several discrete GAN models
\item Evaluation of the created models on our dataset
\item Paper describing the results of our research
\end{itemize}

\section{Strength and weakness of the study}
GANs have proved to be the most successful when it comes to generative images, but their application to the problems of the text generation is not well studied. So we expect our research to introduce novel approaches and original ideas that may advance the field of natural language processing.
However, there are high risks that this approach may not give good results at all, because there are still lots of issues about  usage of neural networks with language data. The other possible weakness is a difficulty to compare the results with other papers, as there are not a lot of researches concerning this or relative subject.
We are also currently looking for supervisors who might be interested in the following topic, so we could have a mentorship during the research.

\bibliographystyle{alpha}
\bibliography{proposal}

\end{document}