key: cord-027316-echxuw74 authors: Modarresi, Kourosh title: Detecting the Most Insightful Parts of Documents Using a Regularized Attention-Based Model date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_20 sha: doc_id: 27316 cord_uid: echxuw74 Every individual text or document is generated for specific purpose(s). Sometime, the text is deployed to convey a specific message about an event or a product. Other occasions, it may be communicating a scientific breakthrough, development or new model and so on. Given any specific objective, the creators and the users of documents may like to know which part(s) of the documents are more influential in conveying their specific messages or achieving their objectives. Understanding which parts of a document has more impact on the viewer’s perception would allow the content creators to design more effective content. Detecting the more impactful parts of a content would help content users, such as advertisers, to concentrate their efforts more on those parts of the content and thus to avoid spending resources on the rest of the document. This work uses a regularized attention-based method to detect the most influential part(s) of any given document or text. The model uses an encoder-decoder architecture based on attention-based decoder with regularization applied to the corresponding weights. The main purpose of NLP (Natural Language Processing) and NLU (Natural Language Understanding) is to understand the language. More specifically, they are focused on not just to see the context of text but also to see how human uses the language in daily life. Thus, among other ways of utilizing this, we could provide an optimal online experience addressing needs of users' digital experience. Language processing and understanding is much more complex than many other applications in machine learning such as image classification as NLP and NLU involve deeper context analysis than other machine learning applications. This paper is written as a short paper and focuses on explaining only the parts that are contribution of this paper to the state-of-the art. Thus, this paper does not describe the state-of-the-art works in details and uses those works [2, 4, 5, 8, 53, 60, 66, 70, 74, 84 ] to build its model as a modification and extension of the state of the art. Therefore, a comprehensive set of reference works have been added for anyone interested in learning more details of the previous state of the art research [3, 5, 10, 17, 33, 48, 49, 61-63, 67-73, 76, 77, 90, 91, 93] . Deep Learning has become a main model in natural language processing applications [6, 7, 11, 22, 38, 55, 64, 71, 75, 78-81, 85, 88, 94] . Among deep learning models, often RNN-based models like LSTM and GRU have been deployed for text analysis [9, 13, 16, 23, 32, 39-42, 50, 51, 58, 59] . Though, modified version of RNN like LSTM and GRU have been improvement over RNN (recurrent neural networks) in dealing with vanishing gradients and long-term memory loss, still they suffer from many deficiencies. As a specific example, a RNN-based encoder-decoder architecture uses the encoded vector (feature vector), computed at the end of encoder, as the input to the decoder and uses this vector as a compressed representation of all the data and information from encoder (input). This ignores the possibility of looking at all previous sequences of the encoder and thus suffers from information bottleneck leading to low precision, especially for texts of medium or long sequences. To address this problem, global attention-based model [2, 5] where each of the encoder sequence uses all of the encoder sequences. Figure 1 shows an attention-based model. Where i = 1:n is the encoder sequences and, t = 1:m represents the decoder sequences. Each of the encoder states looks into the data from all the encoder sequences with specific attention measured by the weights. Each weight, w ti , indicates the attention decoder network t pays for the encoder network i. These weights are dependent on the previous decoder and output states and present encoder state as shown in Fig. 2 . Given the complexity of these dependencies, a neural network model is used to compute these weights. Two layers (1024) of fully connected layers and ReLU activation function is used. Where H is the state of the encoder networks, s t−1 is the previous state of the decoder and v t−1 is the previous decoder output. Also, W t is the weights of the encoder state t. Since W t are the output from softmax function, then, This section overviews of the contribution of this paper and explains the extension made over the state-of-the-art model. A major point of attention for many texts related analysis is to determine which part(s) of the input text has had more impact in determining the output. he length of input text could be very long combining of potentially hundreds and thousands of words or sequences, i.e., n could be very large number. Thus, there are many weights (w ti ) in determining any part of output v t , and also since many of these weights are correlated, it's difficult to determine the significance of any input sequence in computing any output sequence v t . To make these dependencies clearer and to recognize the most significant input sequences for any output sequence, we apply a zero-norm penalty to make the corresponding weight vector to become a sparse vector. To achieve the desired sparsity, zero-norm (L 0 ) is applied to make any corresponding W t vector very sparse as the penalty leads to minimization of the number of non-zero entries in W t . The process is implemented by imposing the constraint of, Since L 0 is computationally intractable, we could use surrogate norms such as L 1 norm or Euclidean norm, L 2 . To impose sparsity, the L 1 norm, LASSO [8, 14, 15, 18, 21] is used in this work, Or, As the penalty function to enforce sparsity on the weight vectors. This penalty, β W t 1 , is the first extension to the attention model [2, 5] . Here, β is the regularization parameter which is set as a hyperparameter where its value is set before learning. Higher constraint leads to higher sparsity with higher added regularization biased error and lower values of the regularization parameter leads to lower sparsity and lesser regularization bias. The main goal of this work is to find out which parts of encoder sequences are most critical in determining and computing any output. The output could be a word, a sentence or any other subsequence. The goal is critical especially in application such as machine translation, image captioning, sentiment analysis, topic modeling and predictive modeling such as time series analysis and prediction. To add another layer of regularization, this work imposes embedding error penalty to the objective function (usually, cross entropy). This added penalty also helps to address the "coverage problem" (the phenomenon of often observed dropping or frequently repeating words --or any other subsequence --by the network). The embedding regularization is, α Embedding Error 2 (6) Input to any model has to be a number and hence the raw input of words or text sequence needs to be transformed to continuous numbers. This is done by using one-hot encoding of the words and then using embedding as shown in Fig. 3 . Whereu i is the raw input text, is the one-hot encoding representation of the raw input and u i is the embedding of the i-th input or sequence. Also, α is the regularization parameter. The idea of embedding is based on that embedding should preserve word similarities, i.e., the words that are synonyms before embedding, should remain synonyms after embedding. Using this concept of embedding, the scaled embedding error is, Or, after scaling the embedding error, Which could be re-written, using a regularization parameter (α), as, Where L is the measure or metric of similarity of words representations. Here, for all similarity measures, both Euclidean norm and cosine similarity (dissimilarity) have been used. In this work, the embedding error using the Euclidean norm is used, Alternatively, we could include the embedding error of the output sequence in Eq. (10) . When the input sequence (or the dictionary) is too long, to prevent high computational complexity of computing similarity of each specific word with all other words, we choose a random (uniform) sample of the input sequences to compute the embedding error. The regularization parameter, α, is computed using cross validation [26] [27] [28] [29] [30] [31] . Alternatively, adaptive regularization parameters [82, 83] could be used. This model was applied on Wikipedia datasets for English-German translation (one-way translation) with 1000 sentences. The idea was to determine which specific input word (in English) is the most important one for the corresponding German translation. The results were often an almost diagonal weight matrix, with few non-zero off diagonal entries, indicating the significance of the corresponding word(s) in the original language (English). Since the model is an unsupervised approach, it's hard to evaluate its performance without using domain knowledge. The next step in this work would be to develop a unified and interpretable metric for automatic testing and evaluation of the model without using any domain knowledge and also to apply the model to other applications such as sentiment analysis. Inverse Problems: Principles and Applications in Geophysics, Technology, and Medicine Polosukhin: attention is all you need Domain adaptation via pseudo in-domain data selection Multiple object recognition with visual attention Neural machine translation by jointly learning to align and translate The dropout learning algorithm Deep Learning an Unsupervised Feature Learning NIPS 2012 Workshop NESTA, a fast and accurate first-order method for sparse recovery Learning long-term de-pendencies with gradient descent is difficult A neural probabilistic language model Theano: a CPU and GPU math expression compiler Audio chord recognition with recurrent neural networks A singular value thresholding algorithm for matrix completion Exact matrix completion via convex optimization Compressive sampling Long short-term memory-networks for machine reading Learning phrase representations using RNN encoder-decoder for statistical machine translation Framewise phoneme classification with bidirectional LSTM and other neural network architectures Generating sequences with recurrent neural networks The Elements of Statistical Learning; Data miNing, Inference and Prediction Handwritten digit recognition via deformable prototypes Gene Shaving' as a method for identifying distinct sets of genes with similar expression patterns Matrix Completion via Iterative Soft-Thresholded SVD Package 'impute'. CRAN Multilingual distributed representations without word alignment Advances in natural language processing Long short-term memory Gradient flow in recurrent nets: the difficulty of learning long-term dependencies Regularization for Applied Inverse and Ill-Posed problems Compositional attention networks for machine reasoning Two case studies in the application of principal component Principal Component Analysis Rotation of principal components: choice of normalization constraints A modified principal component technique based on the LASSO Recurrent continuous translation models Statistical Machine Translation Structured attention networks Statistical phrase-based translation Learning phrase representations using RNN encoder-decoder for statistical machine translation Conditional random fields: Probabilistic models for segmenting and labeling sequence data Neural Networks: Tricks of the Trade A structured self-attentive sentence embedding Effective Approaches to attention-based neural machine translation Learning to recognize features of valid textual entailments Natural logic for textual inference Encyclopedia of Language & Linguistics Introduction to information retrieval The Stanford CoreNLP Natural Language Processing Toolkit. Computer Science, ACL Computational linguistics and deep learning Differentiating language usage through topic models Effective approaches to attention based neural machine translation Application of DNN for Modern Data with two Examples: Recommender Systems & User Recognition. Deep Learning Summit Standardization of featureless variables for machine learning models using natural language processing Generalized variable conversion using k-means clustering and web scraping An efficient deep learning model for recommender systems Effectiveness of Representation Learning for the Analysis of Human Behavior An evaluation metric for content providing models, recommendation systems, and online campaigns Combined Loss Function for Deep Convolutional Neural Networks A Randomized Algorithm for the Selection of Regularization Parameter. Inverse Problem Symposium A local regularization method using multiple regularization levels A decomposable attention model On the difficulty of training recurrent neural networks. In: ICML On the difficulty of training recurrent neural networks How to construct deep recurrent neural networks Fast curvature matrix-vector products for second-order gradient descent Bidirectional recurrent neural networks Continuous space translation models for phrase-based statistical machine translation Continuous space language models for statistical machine translation Sequence to sequence learning with neural networks Google's neural machine translation system: bridging the gap between human and machine translation ADADELTA: an adaptive learning rate method