Monday at a glance

Tuesday-Thursday at a glance

Friday at a glance

Tuesday Papers by Session

Session: Best student ML paper
Location: Hogan
Time: Tuesday, 10:00

Hyperparameter Learning for Conditional Mean Embeddings with Rademacher Complexity Bounds
Kelvin Hsu (University of Sydney); Richard Nock (Data61, CSIRO); Fabio Ramos (U Sydney)
Time: Tuesday, 10.00

Conditional mean embeddings are nonparametric models that encode conditional expectations in a reproducing kernel Hilbert space. While they provide a flexible and powerful framework for probabilistic inference, their performance is highly dependent on the choice of kernel and regularization hyperparameters. Nevertheless, current hyperparameter tuning methods predominantly rely on expensive cross validation or heuristics that is not optimized for the inference task. For conditional mean embeddings with categorical targets and arbitrary inputs, we propose a hyperparameter learning framework based on Rademacher complexity bounds to prevent overfitting by balancing data fit against model complexity. Our approach only requires batch updates, allowing scalable kernel hyperparameter tuning without invoking kernel approximations. Experiments demonstrate that our learning framework outperforms competing methods, and can be further extended to incorporate and learn deep neural network weights to improve generalization.

Session: Learning 1
Location: Hogan
Time: Tuesday, 11:00

High-dimensional Penalty Selection via Minimum Description Length Principle
Kohei Miyaguchi (University of Tokyo), Kenji Yamanishi (University of Tokyo)
Time: Tuesday, 11.00

We tackle the problem of penalty selection of regularization on the basis of the minimum description length~(MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood~(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the minimum description length principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: 1) the computation of the normalizing factor of LNML and 2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method~(MDL-RS), in which a tight upper bound of LNML~(uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.

Similarity encoding for learning with dirty categorical variables
Patricio Cerda (Inria, France and Linear Accelerator Laboratory CNRS, France), Gaël Varoquaux (Inria, France and Linear Accelerator Laboratory CNRS, France), Balázs Kégl (Inria, France and Linear Accelerator Laboratory CNRS, France)
Time: Tuesday, 11.20

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. ‘Dirty’ non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.

Learning from Noisy k-ary Preferences
Yuangang Pan (Centre for Artificial Intelligence, University of Technology, Sydney), Bo Han (Centre for Artificial Intelligence, University of Technology, Sydney), Ivor W. Tsan (Centre for Artificial Intelligence, University of Technology, Sydney)
Time: Tuesday, 11.40

The aggregation of k-ary preferences is a novel ranking problem that plays an important role in several aspects of daily life, such as ordinal peer grading, online image-rating, meta-search and online product recommendation. Meanwhile, crowdsourcing is increasingly emerging as a way to provide a plethora of k-ary preferences for these types of ranking problems, due to the convenience of the platforms and the lower costs. However, preferences from crowd workers are often noisy, which inevitably degenerates the reliability of conventional aggregation models. In addition, traditional inferences usually lead to massive computational costs, which limits the scalability of aggregation models. To address both of these challenges, we propose a reliable CrowdsOUrced Plackett-LucE (COUPLE) model combined with an efficient Bayesian learning technique. To ensure reliability, we introduce an uncertainty vector for each crowd worker in COUPLE, which recovers the ground truth of the noisy preferences with a certain probability. Furthermore, we propose an Online Generalized Bayesian Moment Matching (OnlineGBMM) algorithm, which ensures that COUPLE is scalable to large-scale datasets. Comprehensive experiments on four large-scale synthetic datasets and three real-world datasets show that, COUPLE with OnlineGBMM achieves substantial improvements in reliability and noisy worker detection over the well-known approaches.

Axiomatic Characterization of AdaBoost and the Multiplicative Weight Update Procedure
Ibrahim M Alabdulmohsin (Saudi Aramco)
Time: Tuesday, 12.00

AdaBoost was introduced for binary classification tasks by Freund and Schapire in 1995. Ever since its publication, numerous results have been produced, which revealed surprising links between AdaBoost and related fields, such as information geometry, game theory, and convex optimization. This remarkably comprehensive set of connections suggests that adaBoost is a unique approach that may, in fact, arise out of axiomatic principles. In this paper, we prove that this is indeed the case. We show that three natural axioms on adaptive re-weighting and combining algorithms, also called arcing, suffice to construct adaBoost and, more generally, the multiplicative weight update procedure as the unique family of algorithms that meet those axioms. Informally speaking, our three axioms only require that the arcing algorithm satisfies some elementary notions of additivity, objectivity, and utility. We prove that any method that satisfies these axioms must be minimizing the composition of an exponential loss with an additive function, and that the weights must be updated according to the multiplicative weight update procedure. This conclusion holds in the general setting of learning, which encompasses regression, classification, ranking, and clustering.

Hierarchical Active Learning with Proportion Feedback on Regions
Zhipeng Luo (University of Pittsburgh); Milos Hauskrecht (University of Pittsburgh)
Time: Tuesday, 12.20

Learning of classification models in practice often relies on human annotation effort in which humans assign class labels to data instances. As this process can be very time-consuming and costly, finding effective ways to reduce the annotation cost becomes critical for building such models. To solve this problem, instead of soliciting instance-based annotation we explore region-based annotation as the feedback. A region is defined as a hyper-cubic subspace of the input feature space and it covers a subpopulation of data instances that fall into this region. Each region is labeled with a number in [0,1] (in binary classification setting), representing a human estimate of the positive (or negative) class proportion in the subpopulation. To learn a classifier from region-based feedback we develop an active learning framework that hierarchically divides the input space into smaller and smaller regions. In each iteration we split the region with the highest potential to improve the classification models. This iterative process allows us to gradually learn more refined classification models from more specific regions with more accurate proportions. Through experiments on numerous datasets we demonstrate that our approach offers a new and promising active learning direction that can outperform existing active learning approaches especially in situations when labeling budget is limited and small.

Session: Kernel Methods
Location: Hogan Mezz 1
Time: Tuesday, 11:00

Fast and Provably Effective Multi-view Classification with Landmark-based SVM
Valentina Zantedeschi (Jean Monnet University); Emonet Remi (Université de Saint Etienne); Marc Sebban (Jean Monnet University)
Time: Tuesday, 11.00

We introduce a fast and theoretically founded method for learning landmark-based SVMs in a multi-view classification setting which leverages the complementary information of the different views and linearly scales with the size of the dataset. The proposed method — called MVL-SVM — applies a non-linear projection to the dataset through multi-view similarity estimates w.r.t. a small set of randomly selected landmarks, before learning a linear SVM in this latent space joining all the views. Using the uniform stability framework, we prove that our algorithm is robust to slight changes in the training set leading to a generalization bound depending on the number of views and landmarks. We also show that our method can be easily adapted to a missing-view scenario by only reconstructing the similarities to the landmarks. Empirical results, both in complete and missing view settings, highlight the superior performances of our method, in terms of accuracy and execution time, w.r.t. state of the art techniques.

Nyström-SGD: Fast Learning of Kernel-Classifiers with Conditioned Stochastic Gradient Descent
Lukas Pfahler (TU Dortmund University); Katharina Morik (TU Dortmund)
Time: Tuesday, 11.20

Kernel methods are a popular choice for classification problems, but when solving large-scale learning tasks computing the quadratic kernel matrix quickly becomes infeasible. To circumvent this problem, the Nyström method that approximates the kernel matrix using only a smaller sample of the kernel matrix has been proposed. Other techniques to speed up kernel learning include stochastic first order optimization and conditioning. We introduce Nyström-SGD, a learning algorithm that trains kernel classifiers by minimizing a convex loss function with conditioned stochastic gradient descent while exploiting the low-rank structure of a Nyström kernel approximation. Our experiments suggest that the Nyström-SGD enables us to rapidly train high-accuracy classifiers for large-scale classification tasks.

Large-scale Nonlinear Variable Selection via Kernel Random Features
Magda Gregorova (HES-SO); Jason Ramapuram (HES-SO); Alexandros Kalousis (AU Geneva); Stephane Marchand-Maillet (University of Geneva)
Time: Tuesday, 11.40

We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets.

[Add paper to your calendar]  [Download paper] Reproducible Research

Scalable and Interpretable One-class SVMs with Deep Learning and Random Fourier Features
Minh Nghia Nguyen (Queen’s University Belfast); Vien Ngo (Queen’s University Belfast, UK)
Time: Tuesday, 12.00

One-class support vector machine (OC-SVM) for a long time has been one of the most effective anomaly detection methods and extensively adopted in both research as well as industrial applications. The biggest issue for OC-SVM is yet the capability to operate with large and high-dimensional datasets due to optimization complexity. Those problems might be mitigated via dimensionality reduction techniques such as manifold learning or autoencoder. However, previous work often treats representation learning and anomaly prediction separately. In this paper, we propose autoencoder based one-class support vector machine (AE-1SVM) that brings OC-SVM, with the aid of random Fourier features to approximate the radial basis kernel, into deep learning context by combining it with a representation learning architecture and jointly exploit stochastic gradient descent to obtain end-to-end training. Interestingly, this also opens up the possible use of gradient-based attribution methods to explain the decision making for anomaly detection, which has ever been challenging as a result of the implicit mappings between the input space and the kernel space. To the best of our knowledge, this is the first work to study the interpretability of deep learning in anomaly detection. We evaluate our method on a wide range of unsupervised anomaly detection tasks in which our end-to-end training architecture achieves a performance significantly better than the previous work using separate training.

Frame-based Optimal Design
Sebastian Mair (Leuphana University); Yannick Rudolph (Leuphana University); Vanessa Closius (Leuphana University); Ulf Brefeld (Leuphana)
Time: Tuesday, 12.20

Optimal experimental design (OED) addresses the problem of selecting an optimal subset of the training data for learning tasks. In this paper, we propose to efficiently compute OED by leveraging the geometry of data: We restrict computations to the set of instances lying on the border of the convex hull of all data points. This set is called the frame. We (i) provide the theoretical basis for our approach and (ii) show how to compute the frame in kernel-induced feature spaces. The latter allows us to sample optimal designs for non-linear hypothesis functions without knowing the explicit feature mapping. We present empirical results showing that the performance of frame-based OED is often on par or better than traditional OED approaches, but its solution can be computed up to twenty times faster.

Session: Graphs 1
Location: Hogan Mezz 2
Time: Tuesday, 11:00

Risk-Averse Matchings over Uncertain Graph Databases
Charalampos Tsourakakis (Boston University); Shreyas Sekar (Harvard University); Johnson Lam (Boston University); Liu Yang (Yale University)
Time: Tuesday, 11.00

A large number of applications such as querying sensor networks, and analyzing protein-protein interaction (PPI) networks, rely on mining uncertain graph and hypergraph databases. In this work we study the following problem: Given an uncertain, weighted (hyper)graph, how can we efficiently find a (hyper)matching with high expected reward, and low risk? This problem naturally arises in the context of several important applications, such as online dating, kidney exchanges, and team formation. We introduce a novel formulation for finding matchings with maximum expected reward and bounded risk under a general model of uncertain weighted (hyper)graphs that we introduce in this work. Our model generalizes probabilistic models used in prior work, and captures both continuous and discrete probability distributions, thus allowing to handle privacy-type applications that inject appropriately distributed noise to (hyper)edge weights. Given that our optimization problem is NP-hard, we turn our attention to designing efficient approximation algorithms. For the case of uncertain weighted graphs, we provide a 1/3-approximation algorithm, and a 1/5-approximation algorithm with near optimal runtime. For the case of uncertain weighted hypergraphs, we provide a Omega(1/k)-approximation algorithm, where k is the rank of the hypergraph (i.e., any hyperedge includes at most k nodes), that runs in almost (modulo log factors) linear time. We complement our theoretical results by testing our approximation algorithms on a wide variety of synthetic experiments, where we observe in a controlled setting interesting findings on the trade-off between reward, and risk. We also provide an application of our formulation for providing recommendations of teams that are likely to collaborate, and have high impact.

[Add paper to your calendar]  [Download paper] Reproducible Research

ONE-M: Modeling the Co-evolution of Opinions and Network Connections
Aastha Nigam (University of Notre Dame); Kijung Shin (Carnegie Mellon University); Ashwin Bahulkar (Rensselaer Polytechnic Institute); Bryan Hooi (Carnegie Mellon University); David Hachen (University of Notre Dame); Boleslaw Szymanski (Rensselaer Polytechnic Institute); Christos Faloutsos (CMU, Pittsburgh); Nitesh Chawla (Notre Dame)
Time: Tuesday, 11.20

How do opinions of individuals on controversial issues such as marijuana and gay marriage and their underlying social network connections evolve over time? Do people alter their network to have more like-minded friends or do they change their own opinions? Does the society eventually develop echo chambers? In this paper, we study dynamically evolving networks and changing user opinions to answer these questions. Our contributions are as follows: (a) Discovering Evolution of Polarization in Networks: We present evidence of growing divide among users based on their opinions who eventually form homophilic groups (b) Studying Opinion and Network Co-Evolution: We present observations of how individuals change opinions and position themselves in dynamically changing networks (c) Forecasting Persistence and Change in Opinions and Network: We propose ONE-M to forecast individual beliefs and persistence or dissolution of social ties. Using a unique real-world network dataset including periodic user surveys, we show that ONE-M performs with high accuracy, while outperforming the baseline approaches.

[Add paper to your calendar]  [Download paper] Reproducible Research

Efficiently Summarizing Attributed Diffusion Networks
Sorour E. Amiri (Virginia Tech), Liangzhe Chen (Virginia Tech), B. Aditya Prakash (Virginia Tech)
Time: Tuesday, 11.40

Given a large attributed social network, can we find a compact, diffusion-equivalent representation while keeping the attribute properties? Diffusion networks with user attributes such as friendship, email communication, and people contact networks are increasingly common-place in the real-world. However, analyzing them is challenging due to their large size. In this paper, we first formally formulate a novel problem of summarizing an attributed diffusion graph to preserve its attributes and influence- based properties. Next, we propose ANeTS, an effective sub-quadratic parallelizable algorithm to solve this problem: it finds the best set of candidate nodes and merges them to construct a smaller network of `super-nodes’ preserving the desired properties.Extensive experiments on diverse real-world datasets show that ANeTS outperforms all state-of-the-art baselines (some of which do not even finish in 14 days). Finally, we show how ANeTS helps in multiple applications such as Topic-Aware viral marketing and sense-making of diverse graphs from different domains.

Dynamic Graph Summarization: A Tensor Decomposition Approach
Sofia Fernandes (University of Porto), Hadi Fanaee-T (University of Oslo), João Gama (University of Porto)
Time: Tuesday, 12.00

Due to the scale and complexity of todays’ social networks, it becomes infeasible to mine them with traditional approaches. A possible solution to reduce such scale and complexity is to produce a compact (lossy) version of the network that represents its major properties. This task is known as graph summarization, which is the subject of this research. Our focus is on time-evolving graphs, a more complex scenario where the dynamics of the network also should be taken into account. We address this problem using tensor decomposition, which enables us to capture the multi-way structure of the time-evolving network. This property is unique and is impossible to obtain with other approaches such as matrix factorization. Experimental evaluation on five real world networks implies promising results demonstrating that tensor decomposition is quite useful for summarizing dynamic networks.

Social Centrality using Network Hierarchy and Community Structure
Rakhi Saxena (University of Delhi), Sharanjit Kaur (University of Delhi), Vasudha Bhatnagar (University of Delhi)
Time: Tuesday, 12.20

Several centrality measures have been formulated to quantify the notion of ‘importance’ of actors in social networks. Current measures scrutinize either local or global connectivity of the nodes and have been found to be inadequate for social networks. Ignoring hierarchy and community structure, which are inherent in all human social networks, is the primary cause of this inadequacy. Positional hierarchy and embeddedness of an actor in the community are intuitively crucial determinants of his importance.The theory of social capital asserts that an actor’s importance is derived from his position in network hierarchy as well as from the potential to mobilize resources through intra-community (bonding) and inter-community (bridging) ties. Inspired by this idea, we propose a novel centrality measure SC (Social Centrality) for actors in social networks. Our measure accounts for – i) an individual’s propensity to socialize, and ii) his connections within and outside the community. These two factors are suitably aggregated to produce social centrality score.Comparative analysis of SC measure with classical and recent centrality measures using large public networks shows that it consistently produces more realistic ranking of nodes. The inference is based on the available ground truth for each tested networks. Extensive analysis of rankings delivered by SC measure and mapping with known facts in well-studied networks justifies its effectiveness in diverse social networks. Scalability evaluation of SC measure justifies its efficacy for real-world large networks.

Session: Deep Learning 1
Location: Davin
Time: Tuesday, 11:00

Deep Gaussian Process Autoencoders for Novelty Detection
Rémi Domingues (EURECOM France), Pietro Michiardi (EURECOM France), Jihane Zouaoui (Amadeus France), Maurizio Filippone (EURECOM France)
Time: Tuesday, 11.00

Novelty detection is one of the classic problems in Machine Learning that has applications across several domains. This paper proposes a novel autoencoder based on Deep Gaussian Processes for novelty detection tasks. The learning of the proposed model is made tractable and scalable through the use of random feature approximations and stochastic variational inference. The result is a flexible model that is easy to implement and train, and can be applied to general novelty detection tasks, including large-scale problems and data with mixed-type features. The experiments indicate that the proposed model achieves competitive results with state-of-the-art novelty detection methods.

Efficient Decentralized Deep Learning by Dynamic Model Averaging
Michael Kamp (Fraunhofer IAIS); Linara Adilova (Fraun); Joachim Sicking (Fraunhofer IAIS); Fabian Hüger (Volkswagen Group Research); Peter Schlicht (Volkswagen Group Research); Tim Wirtz (Fraunhofer IAIS); Stefan Wrobel (Fraunhofer IAIS)
Time: Tuesday, 11.20

We propose an efficient protocol for decentralized training of deep neural networks from distributed data sources. The proposed protocol allows to handle different phases of model training equally well and to quickly adapt to concept drifts. This leads to a reduction of communication by an order of magnitude compared to periodically communicating state-of-the-art approaches. Moreover, we derive a communication bound that scales well with the hardness of the serialized learning problem. The reduction in communication comes at almost no cost, as the predictive performance remains virtually unchanged. Indeed, the proposed protocol retains loss bounds of periodically averaging schemes. An extensive empirical evaluation validates major improvement of the trade-off between model performance and communication which could be beneficial for numerous decentralized learning applications, such as autonomous driving, or voice recognition and image classification on mobile phones.

[Add paper to your calendar]  [Download paper] Reproducible Research

Cooperative Multi-Agent Policy Gradient
Guillaume Bono (INSA Lyon, INRIA); Jilles Dibangoye (INSA Lyon, INRIA); Laetitia Matignon (INSA Lyon, INRIA, Univ’ Lyon 1, CNRS); Florian Pereyron (Volvo Group); Olivier Simonin (INSA Lyon)
Time: Tuesday, 11.40

Reinforcement Learning (RL) for Cooperative Multi-Agent Systems is lagging behind the spectacular breakthroughs of single-agent RL, specifically deep-RL. That is mainly due to the fact that assumptions that hold in single-agent settings are often obsolete in cooperative multi-agent systems. To tackle this issue, we investigate cooperative multi-agent policy gradient algorithms. Methods of this family hold in a rather new paradigm in which learning can be accomplished in a centralized manner while execution can still be independent. In other words, while preserving the decentralized control, one can still estimate gradients from collective experiences guided by a centralized action-value function. Using this insight, we introduce cooperative multi-agent actor-critic methods and exhibit sufficient conditions for the convergence to non-trivial local optima. We evaluate our proposition of decentralized actor-critic compared to other algorithms borrowed from more classical paradigms, including fully centralized RL and independent learners.

Domain Adaption in One-shot Learning
Nanqing Dong (Cornell University); Eric Xing (Petuum Inc.)
Time: Tuesday, 12.00

Recent advances in deep learning lead to breakthroughs in many machine learning tasks. Due to the data-driven nature of deep learning, the training procedure often requires large amounts of manually annotated data, which is often unavailable. One-shot learning aims to categorize the new classes unseen in the training set, given only one example of each new class. Can we transfer knowledge learned by one-shot learning from one domain to another? In this paper, we formulate the problem of domain adaption in one-shot image classification, where the training data and test data come from similar but different distribution. We propose a domain adaption framework based on adversarial networks. This framework is generalized for situations where the source and target domain have different labels. We use a policy network, inspired by human learning behaviors, to effectively select samples from the source domain in the training process. This sampling strategy can further improve the domain adaption performance. We investigate our approach in one-shot image classification tasks on different settings and achieve better results than previous methods.

Towards Efficient Forward Propagation on Resource-Constrained Systems
Günther Schindler (Heidelberg University); Matthias Zöhrer (TU Graz); Franz Pernkopf (Graz University of Technology); Holger Froening (University of Heidelberg)
Time: Tuesday, 12.20

In this work we present key elements of DeepChip, a framework that bridges recent trends in machine learning with applicable forward propagation on resource-constrained devices. Main objective of this work is to reduce computational and memory requirements by removing redundancy from neural networks. DeepChip features a flexible quantizer to reduce the bit width of activations to 8-bit fixed-point and weights to an asymmetric ternary representation. In combination with novel algorithms and data compression we leverage reduced precision and sparsity for efficient forward propagation on a wide range of processor architectures. We validate our approach on a set of different convolutional neural networks and datasets: ConvNet on SVHN, ResNet-44 on CIFAR10 and AlexNet on ImageNet. Compared to single-precision floating point, memory requirements can be compressed by a factor of 43, 22 and 10 and computations accelerated by a factor of 5.2, 2.8 and 2.0 on a mobile processor without a loss in classification accuracy. DeepChip allows trading accuracy for efficiency, and for instance tolerating about 2% loss in classification accuracy further reduces memory requirements by a factor of 88, 29 and 13, and speeds up computations by a factor of 6.0, 4.3 and 5.0.

Session: ADS Financial / Security
Location: Nally
Time: Tuesday, 11:00

Using Reinforcement Learning to Conceal Honeypot Functionality
Seamus Dowling (Galway Mayo Institute of Technology)
Time: Tuesday, 11.00

Automated malware employ honeypot detecting mechanisms within their code. Once honeypot functionality has been exposed, malware such as botnets will cease the attempted compromise. Subsequent malware variants employ techniques to evade detection by known honeypots. This reduces the potential size of a captured dataset, which is used for malware behavioural analysis. This paper presents findings on the deployment of a honeypot using reinforcement learning, to conceal functionality. The adaptive honeypot learns the best responses to overcome initial detection attempts, by implementing a reward function with the goal of maximising the number of commands used in an attack. The paper demonstrates that the honeypot quickly identifies the best responses to overcome initial detection and subsequently increases attack command transitions. It also examines the structure of a captured botnet and charts the learning evolution of the honeypot, for repetitive automated malware. Finally it suggests changes to an existing taxonomy governing honeypot development, based on the learning evolution of the adaptive honeypot.

[Add paper to your calendar]  [Download paper] Reproducible Research

Flexible Inference for Cyberbully Incident Detection
Haoti Zhong (Pennsylvania State University); David Miller (Penn State University); Anna Cinzia Squicciarini (Pennsylvania State University)
Time: Tuesday, 11.20

We study detection of cyberbully and cyberaggression incidents in online social networks, focusing on session level analysis. We propose several variants of a customized convolutional neural net (CNN) approach, which processes users’ comments largely independently in the front-end layers, while also accounting for possible conversational patterns. The front-end layer’s outputs are then combined by one of our designed output layers — namely by either a max layer or by a novel sorting layer, proposed by us. Our CNN models outperform existing baselines and are able to achieve classification accuracy of up to 84.29% for cyberbullying and 83.08% for cyberaggression.

Solving the “false positives” problem in fraud prediction using automated feature engineering
James M Kanter (Feature Labs, Inc); Kalyan Veeramachaneni (MIT); Roy Wedge (Feature Labs); Santiago Mora; (BBVA); Sergio Iglesias Perez (BBVA)
Time: Tuesday, 11.40

In this paper, we present an automated feature engineering based approach to dramatically reduce false positives in fraud prediction. False positives plague the fraud prediction industry. It is estimated that only 1 in 5 declared as fraud are actually fraud and roughly 1 in every 6 customers have had a valid transaction declined in the past year. To address this problem, we use the Deep Feature Synthesis algorithm to automatically derive behavioral features based on the historical data of the card associated with a transaction. We generate 237 features (>100 behavioral patterns) for each transaction, and use a random forest to learn a classifier. We tested our machine learning model on data from a large multinational bank and compared it to their existing solution. On an unseen data of 1.852 million transactions, we were able to reduce the false positives by 54% and provide a savings of 190K euros. We also assess how to deploy this solution, and whether it necessitates streaming computation for real time scoring. We found that our solution can maintain similar benefits even when historical features are computed once every 7 days.

Learning Tensor-based Representations from Brain-Computer Interface Data for Cybersecurity
Md Lutfor Rahman (UNIVERSITY OF CALIFORNIA RIVERSIDE); Sharmistha Bardhan (University of California, Riverside); Ajaya Neupane (UNIVERSITY OF CALIFORNIA RIVERSIDE); Evangelos Papalexakis (UC Riverside); Chengyu Song (UNIVERSITY OF CALIFORNIA RIVERSIDE)
Time: Tuesday, 12.00

Understanding, modeling, and explaining neural data is a challenging task. In this paper, we learn tensor-based representations of electroencephalography (EEG) data to classify and analyze the underlying neural patterns related to phishing detection tasks. Specifically, we conduct a phishing detection experiment to collect the data, and apply tensor factorization to it for feature extraction and interpretation. Traditional feature extraction techniques, like power spectral density, autoregressive models, and Fast Fourier transform, can only represent data either in spatial or temporal dimension; however, our tensor modeling leverages both spatial and temporal traits in the input data. We perform a comprehensive analysis of the neural data and show the practicality of multi-way neural data analysis. We demonstrate that using tensor-based representations, we can classify real and phishing websites with accuracy as high as 97%, which outperforms state-of-the-art approaches in the same task by 21%. Furthermore, the extracted latent factors are interpretable, and provide insights with respect to the brain’s response to real and phishing websites.

Uncertainty Modelling in Deep Networks: Forecasting Short and Noisy Series
Axel Brando (BBVA Data & Analytics and Universitat de Barcelona)
Time: Tuesday, 12.20

Deep Learning is a consolidated, state-of-the-art Machine Learning tool to fit a function when provided with large data sets of examples. However, in regression tasks, the straightforward application of Deep Learning models provides a point estimate of the target. In addition, the model does not take into account the uncertainty of a prediction. This represents a great limitation for tasks where communicating an erroneous prediction carries a risk. In this paper we tackle a real-world problem of forecasting impending financial expenses and incomings of customers, while displaying predictable monetary amounts on a mobile app. In this context, we investigate if we would obtain an advantage by applying Deep Learning models with a Heteroscedastic model of the variance of a network’s output. Experimentally, we achieve a higher accuracy than non-trivial baselines. More importantly, we introduce a mechanism to discard low-confidence predictions, which means that they will not be visible to users. This should help enhance the user experience of our product.

Session: Learning 2
Location: Hogan
Time: Tuesday, 14:00

On the Effectiveness of Heuristics for Learning Nested Dichotomies
Vitalik Melnikov (Paderborn University), Eyke Hüllermeier (Paderborn University)
Time: Tuesday, 14.00

In machine learning, so-called nested dichotomies are utilized as a reduction technique, i.e., to decompose a multi-class classification problem into a set of binary problems, which are solved using a simple binary classifier as a base learner. The performance of the (multi-class) classifier thus produced strongly depends on the structure of the decomposition. In this paper, we conduct an empirical study, in which we compare existing heuristics for selecting a suitable structure in the form of a nested dichotomy. Moreover, we propose two additional heuristics as natural completions. One of them is the Best-of-K heuristic, which picks the (presumably) best among K randomly generated nested dichotomies. Surprisingly, and in spite of its simplicity, it turns out to outperform the state of the art.

Local Contrast as An Effective Means to Robust Clustering Against Varying Densities
Bo Chen (Monash University), Kai Ming Ting (Monash University and Federation University Australia), Takashi Washio (Osaka University), Ye Zhu (Deakin University)
Time: Tuesday, 14.20

Most density-based clustering methods have difficulties detecting clusters of hugely different densities in a dataset. A recent density-based clustering CFSFDP appears to have mitigated the issue. However, through formalising the condition under which it fails, we reveal that CFSFDP still has the same issue. To address this issue, we propose a new measure called Local Contrast, as an alternative to density, to find cluster centers and detect clusters. We then apply Local Contrast to CFSFDP, and create a new clustering method called LC-CFSFDP which is robust in the presence of varying densities. Our empirical evaluation shows that LC-CFSFDP outperforms CFSFDP and three other state-of-the-art variants of CFSFDP.

AWX: An Integrated Approach to Hierarchical-Multilabel Classification
Luca Masera (University of Trento); Enrico Blanzieri (University of Trento)
Time: Tuesday, 14.40

The recent outbreak of works on artificial neural networks (ANNs) has reshaped the machine learning scenario. Despite the vast literature, there is still a lack of methods able to tackle the hierarchical multilabel classification (HMC) task exploiting entirely ANNs. Here we propose AWX, a novel approach that aims to fill this gap. AWX is a versatile component that can be used as output layer of any ANN, whenever a fixed structured output is required, as in the case of HMC. AWX exploits the prior knowledge on the output domain embedding the hierarchical structure directly in the network topology. The information flows from the leaf terms to the inner ones allowing a jointly optimization of the predictions. Different options to combine the signals received from the leaves are proposed and discussed. Moreover, we propose a generalization of the true path rule to the continuous domain and we demonstrate that AWX’s predictions are guaranteed to be consistent with respect to it. Finally, the proposed method is evaluated on 10 benchmark datasets and shows a significant increase in the performance over plain ANN, HMC-LMLP, and the state-of-the-method Clus-HMC.

[Add paper to your calendar]  [Download paper] Reproducible Research

Modular Dimensionality Reduction
Henry W J Reeve (University of Birmingham); Tingting Mu (University of Manchester); Gavin Brown (The University Of Manchester)
Time: Tuesday, 15.00

We introduce an approach to modular dimensionality reduction, allowing efficient learning of multiple complementary representations of the same object. Modules are trained by optimising an unsupervised cost function which balances two competing goals: Maintaining the inner product structure within the original space, and encouraging structural diversity between complementary representations. We derive an efficient learning algorithm which outperforms gradient based approaches without the need to choose a learning rate. We also demonstrate an intriguing connection with Dropout. Empirical results demonstrate the efficacy of the method for image retrieval and classification.

Robust Super-Level Set Estimation using Gaussian Processes
Andrea Zanette (Stanford University); Junzi Zhang (Stanford University); Mykel Kochenderfer (Stanford University)
Time: Tuesday, 15.20

This paper focuses on the problem of determining as large a region as possible where a function exceeds a given threshold with high probability. We assume we only have access to a noise-corrupted version of the function and that function evaluations are costly. To select the next query point, we propose maximizing the expected increase in the area identified as above the threshold as predicted by a Gaussian process, robustified by a variance term. We also give asymptotic guarantees on the algorithm. Therefore, our approach can outperform existing techniques in the literature while enjoying some convergence guarantee.

Session: Classification
Location: Hogan Mezz 1
Time: Tuesday, 14:00

Multiple Instance Learning with Bag-level Randomized Trees
Tomas Komarek (Czech Technical University in Prague); Petr Somol (Czech Academy of Sciences)
Time: Tuesday, 14.00

Knowledge discovery in databases with a flexible structure poses a great challenge to machine learning community. Multiple Instance Learning (MIL) aims at learning from samples (called bags) represented by multiple feature vectors (called instances) as opposed to single feature vectors characteristic for the traditional data representation. This relaxation turns out to be useful in formulating many machine learning problems including classification of molecules, cancer detection from tissue images or identification of malicious network communications. However, despite the recent progress in this area, the current set of MIL tools still seems to be very application specific and/or burdened with many tuning parameters or processing steps. In this paper, we propose a simple, yet effective tree-based algorithm for solving MIL classification problems. Empirical evaluation against 28 classifiers on 29 publicly available benchmark datasets shows a high level performance of the proposed solution even with its default parameter settings.

One-class Quantification
Denis M. Reis (Universidade de São Paulo); André G. Maletzke (Universidade de São Paulo); Everton Cherman (Universidade de São paulo); Gustavo Batista (USP)
Time: Tuesday, 14.20

This paper proposes one-class quantification, a new Machine Learning task. Quantification estimates the class distribution of an unlabeled sample of instances. Similarly to one-class classification, we assume that only a sample of examples of a single class is available for learning, and we are interested in counting the cases of such class in a test set. We formulate, for the first time, one-class quantification methods and assess them in a comprehensible open-set evaluation. In an open-set problem, several ‘subclasses” represent the negative class, and we cannot assume to have enough observations for all of them at training time. Therefore, new classes may appear after deployment, making this a challenging setup for existing quantification methods. We show that our proposals are simple and more accurate than the state-of-the-art in quantification. Finally, the approaches are very efficient, fitting batch and stream applications.

[Add paper to your calendar]  [Download paper] Reproducible Research

Ordinal Label Proportions
Rafael Poyiadzi (University of Bristol); Raul Santos Rodriguez (University of Bristol); Tijl De Bie (Ghent University)
Time: Tuesday, 14.40

In Machine Learning, it is common to distinguish different degrees of supervision, ranging from fully supervised to completely unsupervised scenarios. However, lying in between those, the Learning from Label Proportions (LLP) setting \cite{yu2014learning} assumes the training data is provided in the form of bags, and the only supervision comes through the proportion of each class in each bag. In this paper, we present a novel version of the LLP paradigm where the relationship among the classes is ordinal. While this is a highly relevant scenario (e.g. customer surveys where the results can be divided into various degrees of satisfaction), it is as yet unexplored in the literature. We refer to this setting as Ordinal Label Proportions (OLP). We formally define the scenario and suggest an efficient algorithm to tackle it. We test our algorithm on synthetic and benchmark datasets. Additionally, we present a case study examining the data gathered from the Research Excellence Framework that assesses the quality of research in the United Kingdom.

Learning from binary labels with instance-dependent noise
Aditya Krishna Menon (Australian National University), Brendan van Rooyen (Australian National University), Nagarajan Natarajan (Microsoft Research)
Time: Tuesday, 15.00

Supervised learning has seen numerous theoretical and practical advances over the last few decades. However, its basic assumption of identical train and test distributions often fails to hold in practice. One important example of this is when the training instances are subject to label noise: that is, where the observed labels do not accurately reflect the underlying ground truth. While the impact of simple noise models has been extensively studied, relatively less attention has been paid to the practically relevant setting of instance-dependent label noise. It is thus unclear whether one can learn, both in theory and in practice, good models from data subject to such noise, with no access to clean labels.We provide a theoretical analysis of this issue, with three contributions. First, we prove that for instance-dependent noise, any algorithm that is consistent for classification on the noisy distribution is also consistent on the noise-free distribution. Second, we prove that consistency also holds for the area under the ROC curve, assuming the noise scales (in a precise sense) with the inherent difficulty of an instance. Third, we show that the Isotron algorithm can efficiently and provably learn from noisy samples when the noise-free distribution is a generalised linear model. We empirically confirm our theoretical findings, which we hope may stimulate further analysis of this important learning setting.

Output Fisher Embedding Regression
Moussab Djerrab (Télécom ParisTech), Alexandre Garcia (Télécom ParisTech), Maxime Sangnier (Sorbonne Universités), Florence d’Alche–Buc (Télécom ParisTech)
Time: Tuesday, 15.20

We investigate the use of Fisher vector representations in the output space in the context of structured and multiple output prediction. A novel, general and versatile method called output Fisher embedding regression is introduced. Based on a probabilistic modeling of training output data and the minimization of a Fisher loss, it requires to solve a pre-image problem in the prediction phase. For Gaussian Mixture Models and State-Space Models, we show that the pre-image problem enjoys a closed-form solution with an appropriate choice of the embedding. Numerical experiments on a wide variety of tasks (time series prediction, multi-output regression and multi-class classification) highlight the relevance of the approach for learning under limited supervision like learning with a handful of data per label and weakly supervised learning.

Session: Graphs 2
Location: Hogan Mezz 2
Time: Tuesday, 14:00

Using Core-Periphery Structure to Predict High Centrality Nodes in Time-Varying Networks
Soumya Sarkar (Indian Institute of Technology Kharagpur), Sandipan Sikdar (Indian Institute of Technology Kharagpur), Sanjukta Bhowmick (University of Nebraska), Animesh Mukherjee (Indian Institute of Technology Kharagpur)
Time: Tuesday, 14.00

Vertices with high betweenness and closeness centrality represent in-fluential entities in a network. An important problem for time varying networks isto know a-priori, using minimal computation, whether the influential vertices ofthe current time step will retain their high centrality, in the future time steps, asthe network evolves.In this paper, based on empirical evidences from several large real world time varying networks, we discover a certain class of networks where the highly central vertices are part of the innermost core of the network and this property is maintained over time. As a key contribution of this work, we propose novel heuristics toidentify these networks in an optimal fashion and also develop a two-step algorithm for predicting high centrality vertices. Consequently, we show for the first time that for such networks, expensive shortest path computations in each timestep as the network changes can be completely avoided; instead we can use time series models (e.g., ARIMA as used here) to predict the overlap between the high centrality vertices in the current time step to the ones in the future time steps. Moreover,once the new network is available in time, we can find the high centrality verticesin the top core simply based on their high degree.To measure the effectiveness of our framework, we perform prediction task ona large set of diverse time-varying networks. We obtain F1-scores as high as 0.81 and 0.72 in predicting the top k closeness and betweenness centrality vertices respectively for networks where the highly central vertices mostly reside in the innermost core. We validate our results by showing that the practical effects of our predicted vertices match the effects of the actual high centrality vertices. Finally, we also provide a formal sketch demonstrating why our method works.

Think before You Discard: Accurate Triangle Counting in Graph Streams with Deletions
Kijung Shin (Carnegie Mellon University); Jisu Kim (Carnegie Mellon University); Bryan Hooi (Carnegie Mellon University); Christos Faloutsos (CMU, Pittsburgh)
Time: Tuesday, 14.20

Given a stream of edge additions and deletions, how can we estimate the count of triangles in it? If we can store only a subset of the edges, how can we obtain unbiased estimates with small variances? Counting triangles (i.e., cliques of size three) in a graph is a classical problem with applications in a wide range of research areas, including social network analysis, data mining, and databases. Recently, streaming algorithms for triangle counting have been extensively studied since they can naturally be used for large dynamic graphs. However, existing algorithms cannot handle edge deletions or suer from low accuracy. Can we handle edge deletions while achieving high accuracy? We propose ThinkD, which accurately estimates the counts of global triangles (i.e., all triangles) and local triangles associated with each node in a fully dynamic graph stream with edge additions and deletions. Compared to its best competitors, ThinkD is (a) Accurate: up to 4.3X_x0002_ more accurate within the same memory budget, (b) Fast: up to 2.2X_x0002_ faster for the same accuracy requirements, and (c) Theoretically sound: always maintaining unbiased estimates with small variances.

[Add paper to your calendar]  [Download paper] Reproducible Research

Dynamic hierarchies in temporal directed networks
Nikolaj Tatti (Aalto University)
Time: Tuesday, 14.40

The outcome of interactions in many real-world systems can be often explained by a hierarchy between the participants. Discovering hierarchy from a given directed network can be formulated as follows: partition vertices into levels such that, ideally, there are only forward edges, that is, edges from upper levels to lower levels. In practice, the ideal case is impossible, so instead we minimize some penalty function on the backward edges. One practical option for such a penalty is agony, where the penalty depends on the severity of the violation. In this paper we extend the definition of agony to temporal networks. In this setup we are given a directed network with time stamped edges, and we allow the rank assignment to vary over time. We propose 2 strategies for controlling the variation of individual ranks. In our first variant, we penalize the fluctuation of the rankings over time by adding a penalty directly to the optimization function. In our second variant we allow the rank change at most once. We show that the first variant can be solved exactly in polynomial time while the second variant is NP-hard, and in fact inapproximable. However, we develop an iterative method, where we first fix the change point and optimize the ranks, and then fix the ranks and optimize the change points, and reiterate until convergence. We show empirically that the algorithms are reasonably fast in practice, and that the obtained rankings are sensible.

[Add paper to your calendar]  [Download paper] Reproducible Research

Temporally Evolving Community Detection and Prediction in Content-Centric Networks
Ana Paula Appel (IBM); Renato Cunha (IBM); Marcela Megumi Terakado (IME – USP); Charu Aggarwal (IBM)
Time: Tuesday, 15.00

In this work, we consider the problem of combining link, content and temporal analysis for community detection and prediction in evolving networks. Such temporal and content-rich networks occur in many real-life settings, such as bibliographic networks and question answering forums. Most of the work in the literature (that uses both content and structure) deals with static snapshots of networks, and they do not reflect the dynamic changes occurring over multiple snapshots. Incorporating dynamic changes in the communities into the analysis can also provide useful insights about the changes in the network such as the migration of authors across communities. In this work, we propose a shared factorization model that can simultaneously account for graph links, content, and temporal analysis. This approach works by extracting the latent semantic structure of the network in multidimensional form, but in a way that takes into account the temporal continuity of these embeddings. Such an approach simplifies temporal analysis of the under- lying network by using the embedding as a surrogate. A consequence of this simplification is that it is also possible to use this temporal sequence of embeddings to predict future communities. We present experimental results illustrating the effectiveness of the approach.

Local topological data analysis to uncover the global structure of data approaching graph-structured topologies
Robin Vandaele (Ghent University); Tijl De Bie (Ghent University); Yvan Saeys (Ghent University – VIB)
Time: Tuesday, 15.20

Gene expression data of differentiating cells, galaxies distributed in space, and earthquake locations, all share a common property: they lie close to a graph-structured topology in their respective spaces, referred to as one-dimensional stratified spaces in mathematics. Often, the uncovering of such topologies offers great insight into these data sets. However, methods for dimensionality reduction are clearly inappropriate for this purpose, and also methods from the relatively new field of Topological Data Analysis (TDA) are inappropriate, due to noise sensitivity, computational complexity, or other limitations. In this paper we introduce a new method, termed Local TDA (LTDA), which resolves the issues of pre-existing methods by unveiling (global) graph-structured topologies in data by means of robust and computationally cheap local analyses. Our method rests on a simple graph-theoretic result that enables one to identify isolated, end-, edge- and multifurcation points in the topology underlying the data. It then uses this information to piece together a graph that is homeomorphic to the unknown one-dimensional stratified space underlying the point cloud data. We evaluate our method on a number of artificial and real-life data sets, demonstrating its superior effectiveness, robustness against noise, and scalability.

Session: Deep Learning 2
Location: Davin
Time: Tuesday, 14:00

Optimizing Non-decomposable Measures with Deep Networks
Amartya Sanyal (University of Oxford and The Alan Turing Institute), Pawan Kumar (Indian Institute of Technology Kanpur), Purushottam Kar (Indian Institute of Technology Kanpur), Sanjay Chawla (Qatar Computing Research Institute), Fabrizio Sebastiani (ISTI-CNR Pisa)
Time: Tuesday, 14.00

We present a class of algorithms capable of directly training deep neural networks with respect to popular families of task-specific performance measures for binary classification such as the F-measure, QMean and the Kullback-Leibler divergence that are structured and non-decomposable. Our goal is to address tasks such as label- imbalanced learning and quantification. Our techniques present a departure from standard deep learning techniques that typically use squared or cross-entropy loss functions (that are decomposable) to train neural networks. We demonstrate that directly training with task-specific loss functions yields faster and more stable convergence across problems and datasets. Our proposed algorithms and implementations offer several advantages including (i) use of fewer training samples to achieve a desired level of convergence, (ii) a substantial reduction in training time, (iii) a seamless integration of our implementation into existing symbolic gradient frameworks, and (iv) assurance of convergence to first order stationary points. It is noteworthy that the algorithms achieve this, especially point (iv) despite being asked to optimize complex objective functions. We implement our techniques on a variety of deep architectures including multi-layer perceptrons and recurrent neural networks and show that on a variety of benchmark and real data sets, our algorithms outperform traditional approaches to training deep networks, as well as popular techniques used to handle label imbalance.

Deep F-measure Maximization in Multi-label Classification: A Comparative Study
Stijn Decubber (ML6); Thomas Mortier (Ghent University); Krzysztof J Dembczyński (Poznan University of Technology); Willem Waegeman (Universiteit Gent)
Time: Tuesday, 14.20

In recent years several novel algorithms have been developed for maximizing the instance-wise $F_\beta$-measure in multi-label classification problems. However, so far, such algorithms have only been tested in tandem with shallow base learners. In the deep learning landscape, usually simple thresholding approaches are implemented, even though it is expected that such approaches are suboptimal. In this article we introduce extensions of utility maximization and decision-theoretic methods that can optimize the $F_\beta$-measure with (convolutional) neural networks. We discuss pros and cons of the different methods and we present experimental results on several image classification datasets. The results illustrate that decision-theoretic inference algorithms are worth the investment. While being more difficult to implement compared to thresholding strategies, they lead to a superior predictive performance. Overall, a decision-theoretic inference algorithm based on proportional odds models outperforms the other methods.

Privacy Preserving Synthetic Data Release Using Deep Learning
Nazmiye C ABAY (university of texas at dallas); Murat Kantarcioglu (UT Dallas); Yan Zhou (university of texas at dallas); Bhavani Thuraisingham (university of texas at dallas); Latanya Sweeney (Data Privacy Lab at Harvard University)
Time: Tuesday, 14.40

For many critical applications ranging from health care to social sciences, releasing personal data while protecting individual privacy is paramount. Over the years, data anonymization and synthetic data generation techniques have been proposed to address this challenge. Unfortunately, data anonymization approaches do not provide rigorous privacy guarantees. Although, there are existing synthetic data generation techniques that use rigorous definitions of differential privacy, to our knowledge, these techniques have not been compared extensively using different utility metrics. In this work, we provide two novel contributions. First, we compare existing techniques on different datasets using different utility metrics. Second, we present a novel approach that utilizes deep learning techniques coupled with an efficient analysis of privacy costs to generate differentially private synthetic datasets with higher data utility. We show that we can learn deep learning models that can capture relationship among multiple features, and then use these models to generate differentially private synthetic datasets. Our extensive experimental evaluation conducted on multiple datasets indicates that our proposed approach is more robust (i.e., one of the top performing technique in almost all type of data we have experimented) compared to the state-of-the art methods in terms of various data utility measures.

On Finer Control of Information Flow in LSTMs
Hang Gao (University of Maryland Baltimore County); Tim Oates (University Of Maryland Baltimore County)
Time: Tuesday, 15.00

Since its inception in 1995, Long Short-Term Memory (LSTM) architecture for recurrent neural networks has shown promising performance, sometimes state-of-art, for various tasks. Aiming at achieving constant error flow through hidden units, LSTM introduces a complex unit called memory cell, in which gates are adopted to control the exposure/isolation of information flowing in, out and back to itself. Despite of its widely acknowledged success, in this paper, we propose a hypothesis that LSTMs may suffer from an implicit functional binding of information exposure/isolation for the output and candidate computation, i.e., the output gate at time t-1 is not only in charge of the information flowing out of a cell as the response to the external environment, but also controls the information flowing back to the cell for the candidate computation, which is often the only source of nonlinear combination of input at time t and previous cell state at time t-1 for cell memory update. We propose Untied Long Short Term Memory (ULSTM) as a solution to the above problem. We test our model on various tasks, including semantic relatedness prediction, language modeling and sentiment classification. The experiment results indicate that our proposed model is capable to at least partially solve the problem and outperform LSTM for all these tasks.

Auxiliary guided autoregressive variational autoencoders
Thomas LUCAS (INRIA); Jakob Verbeek (INRIA)
Time: Tuesday, 15.20

Generative modeling of high-dimensional data is a key problem in machine learning. Successful approaches include latent variable models and autoregressive models. The complementary strengths of these approaches, to model global and local image statistics respectively, suggest hybrid models that encode global image structure into latent variables while autoregressively modeling low level detail. Prior approaches to such hybrid models needed to restrict the capacity of the autoregressive decoder to prevent degenerate models that ignore the latent variables and only rely on autoregressive modeling. Our contribution is a training procedure relying on an auxiliary loss function that controls whatinformation is captured by the latent variables and what is left to the autoregressive decoder. Our approach can leverage arbitrarily powerfull autoregressive decoder, achieves state-of-the art quantitative performance among models with latent variables and generates qualitatively convincing samples.

Session: ADS Learning
Location: Nally
Time: Tuesday, 14:00

From Empirical Analysis to Public Policy: Evaluating Housing Systems for Homeless Youth
Hau Chan (University of Nebraska-Lincoln); Eric Rice (University of Southern California); Phebe Vayanos (University of Southern California); Milind Tambe (University of Southern California); Matthew Morton (Chapin Hall at the University of Chicago)
Time: Tuesday, 14.00

There are nearly 2 million homeless youth in the United States each year. Coordinated entry systems are being used to provide homeless youth with housing assistance across the nation. Despite these efforts, the number of youth still homeless or unstably housed remains very high. Motivated by this fact, we initiate a first study to understand and analyze the current governmental housing systems for homeless youth. In this paper, we aim to provide answers to the following questions: (1) What is the current governmental housing system for assigning homeless youth to different housing assistances? (2) Can we infer the current assignment guidelines of the local housing communities? (3) What is the result and outcome of the current assignment process? (4) Can we predict whether the youth will be homeless after receiving the housing assistances? To answer these questions, we first provide an overview of the current housing systems. Next, we use simple and interpretable machine learning tools to infer the decision rules of the local communities and evaluate the outcomes of such assignment. We then determine whether the vulnerability features/rubrics can be used to predict youth’s homelessness status after receiving housing assistance. Finally, we discuss the policy recommendations from our study for the local communities and the U.S. Housing and Urban Development (HUD).

LinNet: Probabilistic Basketball Lineup Evaluation Through Network Embedding
Konstantinos Pelechrinis (University of Pittsburgh)
Time: Tuesday, 14.20

Which of your team’s possible lineups has the best chances against each of your opponent’s possible lineups? To answer this ques- tion, we develop LinNet (which stands for LINeup NETwork). LinNet exploits the dynamics of a directed network that captures the perfor- mance of lineups during their matchups. The nodes of this network rep- resent the different lineups, while an edge from node B to node A exists if lineup λ_A has outperformed lineup λ_B. We further annotate each edge with the corresponding performance margin (point margin per minute). We then utilize this structure to learn a set of latent features for each node (i.e., lineup) using the node2vec framework. Consequently, using the latent, learned features, LinNet builds a logistic regression model for the probability of lineup λ_A outperforming lineup λ_B. We evaluate the proposed method by using NBA lineup data from the five seasons between 2007-08 and 2011-12. Our results indicate that our method has an out-of-sample accuracy of 68%. In comparison, utilizing simple network centrality metrics (i.e., PageRank) achieves an accuracy of just 53%, while using the adjusted plus-minus of the players in the lineup for the same prediction problem provides an accuracy of only 55%. We have also explored the adjusted lineups’ plus-minus as our predictors and obtained an accuracy of 59%. Furthermore, the probability output of LinNet is well-calibrated as indicated by the Brier score, while the reliability curve is approximated well by the straight-line y=x. One of the main benefits of LinNet is its generic nature that allows it to be applied in different sports since the only input required is the lineups’ matchup network, i.e., not any sport-specific features are needed.

Improving Emotion Detection with Sub-clip Boosting
Ermal Toto (Worcester Polytechnic Institute); Elke Rundensteiner (WPI); Brendan Foley (WPI)
Time: Tuesday, 14.40

With the emergence of systems such as Amazon Echo, Google Home, and Siri, voice has become a prevalent mode for humans to interact with machines. Emotion detection from voice promises to transform a wide range of applications, from adding emotional-awareness to voice assistants, to creating more sensitive robotic helpers for the elderly. Unfortunately, due to individual differences, emotion expression varies dramatically, making it a challenging problem. To tackle this challenge, we introduce the Sub-Clip Classification Boosting (SCB) Framework, a multi-step methodology for emotion detection from non-textual features of audio clips. SCB features a highly-effective sub-clip boosting methodology for classification that, unlike traditional boosting using feature subsets, instead works at the sub-instance level. Multiple sub-instance classifications increase the likelihood that an emotion cue will be found within a voice clip, even if its location varies between speakers. First, each parent voice clip is decomposed into overlapping sub-clips. Each sub-clip is then independently classified. Further, the Emotion Strength of the sub-classifications is scored to form a sub-classification and strength pair. Finally we design a FilterBoost-inspired ‘Oracle’, that utilizes sub-classification and Emotion Strength pairs to determine the parent clip classification. To tune the classification performance, we explore the relationships between sub-clip properties, such as length and overlap. Evaluation on 3 prominent benchmark datasets demonstrates that our SCB method consistently outperforms all state-of-the art-methods across diverse languages and speakers.

Machine Learning for Targeted Assimilation of Satellite Data
Yu-Ju Lee (University of Colorado Boulder); David Hall (University of Colorado Boulder); Jebb Stewart (Cooperative Institute for Research in the Atmosphere); Mark Govett (National Oceanic and Atmospheric Administration)
Time: Tuesday, 15.00

Optimizing the utilization of huge data sets is a challenging problem for weather prediction. To a significant degree, prediction accuracy is determined by the data used in model initialization, assimilated from a variety of observational platforms. At present, the volume of weather data collected in a given day greatly exceeds the ability of assimilation systems to make use of it. Typically, data is ingested uniformly at the highest fixed resolution that enables the numerical weather prediction (NWP) model to deliver its prediction in a timely fashion. In order to make more efficient use of newly available high-resolution data sources, we seek to identify regions of interest (ROI) where increased data quality or volume is likely to significantly enhance weather prediction accuracy. In particular, we wish to improve the utilization of data from the recently launched Geostationary Operation Environmental Satellite (GOES)-16, which provides orders of magnitude more data than its predecessors. To achieve this, we demonstrate a method for locating tropical cyclones using only observations of precipitable water, which is evaluated using the Global Forecast System (GFS) weather prediction model. Most state of the art hurricane detection techniques rely on multiple feature sets, including wind speed, wind direction, temperature, and IR emissions, potentially from multiple data sources. In contrast, we demonstrate that this model is able to achieve comparable performance on historical tropical cyclone data sets, using only observations of precipitable water.

Discovering Groups of Signals in In-Vehicle Network Traces for Redundancy Detection and Functional Grouping
Artur Mrowca (BMW Group); Barbara Moser (BMW Group); Stephan Günnemann (Technical University of Munich)
Time: Tuesday, 15.20

Modern vehicles exchange signals across multiple ECUs in order to run various functionalities. With increasing functional complexity the amount of distinct signals grew too large to be analyzed manually. During development of a car only subsets of such signals are relevant per analysis and functional group. Moreover, historical growth led to redundancies in signal specifications which need to be discovered. Both tasks can be solved through the discovery of groups. While the analysis of in-vehicle signals is increasingly studied, the grouping of relevant signals as a basis for those tasks was examined less. We therefore present and extensively evaluate a processing and clustering approach for semi-automated grouping of in-vehicle signals based on traces recorded from fleets of cars.

Session: Transfer Learning
Location: Hogan
Time: Tuesday, 16:00

Towards more Reliable Transfer Learning
Zirui Wang (Carnegie Mellon University); Jaime Carbonell (Carnegie Mellon University)
Time: Tuesday, 16.00

Multi-source transfer learning has been proven effective when within-target labeled data is scarce. Previous work focuses primarily on exploiting domain similarities and assumes that source domains are richly or at least comparably labeled. While this strong assumption is never true in practice, this paper relaxes it and addresses challenges related to sources with diverse labeling volume and diverse reliability. The first challenge is combining domain similarity and source reliability by proposing a new transfer learning method that utilizes both source-target similarities and inter-source relationships. The second challenge involves pool-based active learning where the oracle is only available in source domains, resulting in an integrated active transfer learning framework that incorporates distribution matching and uncertainty sampling. Extensive experiments on synthetic and two real-world datasets clearly demonstrate the superiority of our proposed methods over several baselines including state-of-the-art transfer learning methods.

Web-Induced Heterogeneous Transfer Learning with Sample Selection
Sanatan Sukhija (IIT Ropar); Narayanan C Krishnan (IIT Ropar)
Time: Tuesday, 16.20

Transfer learning algorithms utilize knowledge from a data-rich source domain to learn a model in the target domain where labeled data is scarce. This paper presents a novel solution for the challenging and interesting problem of Heterogeneous Transfer Learning (HTL) where the source and target task have heterogeneous feature and label spaces. Contrary to common space based HTL algorithms, the proposed HTL algorithm adapts source data for the target task. The correspondence required for aligning the heterogeneous features of the source and target domain is obtained through labels across two domains that are semantically aligned using web-induced knowledge. The experimental results suggest that the proposed algorithm performs significantly better than state-of-the-art transfer approaches on three diverse real-world transfer tasks.

Differentially Private Hypothesis Transfer Learning
Yang Wang (University of Virginia); Quanquan Gu (University of California, Los Angeles); Donald Brown (University of Virginia)
Time: Tuesday, 16.40

In recent years, the focus of machine learning has been shifting to the paradigm of transfer learning where the data distribution in the target domain differs from that in the source domain. This is a prevalent setting in real-world classification problems and numerous well-established theoretical results in the classical supervised learning paradigm will break down under this setting. In addition, the increasing privacy protection awareness restricts access to source domain samples and poses new challenges for the development of privacy-preserving transfer learning algorithms. In this paper, we propose a novel differentially private multiple-source hypothesis transfer learning method for logistic regression. The target learner operates on differentially private hypotheses and importance weighting information from the sources to construct informative Gaussian priors for its logistic regression model. By leveraging a publicly available auxiliary data set, the importance weighting information can be used to determine the relationship between the source domain and the target domain without leaking source data privacy. Our approach provides a robust performance boost even when high quality labeled samples are extremely scarce in the target data set. The extensive experiments on two real-world data sets confirm the performance improvement of our approach over several baselines. Keywords: Differential privacy, Transfer learning.

Information-theoretic Transfer Learning framework for Bayesian Optimisation
Anil Ramachandran (Deakin University, Australia); Sunil Gupta (Deakin University, Australia); Santu Rana (Deakin University, Australia); Svetha Venkatesh (Deakin University)
Time: Tuesday, 17.00

Transfer learning in Bayesian optimisation is a popular way to alleviate “cold start” issue. However, most of the existing transfer learning algorithms use overall function space similarity, not a more aligned similarity measure for Bayesian optimisation based on the location of the optima. That makes these algorithms fragile to noisy perturbations, and even simple scaling of function values. In this paper, we propose a robust transfer learning based approach that transfer knowledge of the optima using a consistent probabilistic framework. From the finite samples for both source and target, a distribution on the optima is computed and then divergence between these distributions are used to compute similarities. Based on the similarities a mixture distribution is constructed, which is then used to build a new information-theoretic acquisition function in a manner similar to Predictive Entropy Search (PES). The proposed approach also offers desirable “no bias” transfer learning in the limit. Experiments on both synthetic functions and a set of hyperparameter tuning tests clearly demonstrate the effectiveness of our approach compared to the existing transfer learning methods.

Feature Selection for Unsupervised Domain Adaptation using Optimal Transport
Léo Gautheron (Laboratoire Hubert Curien); Ievgen Redko (CREATIS); Carole Lartizien (CREATIS)
Time: Tuesday, 17.20

In this paper, we propose a new feature selection method for unsupervised domain adaptation based on the emerging optimal transportation theory. We build upon a recent theoretical analysis of optimal transport in domain adaptation and show that it can directly suggest a feature selection procedure leveraging the shift between the domains. Based on this, we propose a novel algorithm that aims to sort features by their similarity across the source and target domains, where the order is obtained by analyzing the coupling matrix representing the solution of the proposed optimal transportation problem. We evaluate our method on a well-known benchmark data set and illustrate its capability of selecting correlated features leading to better classification performances. Furthermore, we show that the proposed algorithm can be used as a pre-processing step for existing domain adaptation techniques ensuring an important speed-up in terms of the computational time while maintaining comparable results. Finally, we validate our algorithm on clinical imaging databases for computer-aided diagnosis task with promising results.

[Add paper to your calendar]  [Download paper] Reproducible Research

Session: Anomaly and Outlier Detection
Location: Hogan Mezz 1
Time: Tuesday, 16:00

Explaining Anomalies in Groups with Characterizing Subspace Rules
Meghanath MY (Carnegie Mellon University), Leman Akoglu (Carnegie Mellon University)
Time: Tuesday, 16.00

Anomaly detection has numerous applications and has been studied vastly. We tap into a gap in the literature and consider a complementary problem that has a much sparser literature: anomaly description. Interpretation of anomalies is crucial for practitioners for sense-making, troubleshooting, and planning actions. To this end, we present a new approach called x-PACS (for eXplaining Patterns of Anomalies with Characterizing Subspaces), which reverse-engineers the known anomalies by identifying (1) the groups (or patterns) that they form, and (2) the characterizing subspace and feature rules that separate each anomalous pattern from normal instances. Explaining anomalies in groups not only saves analyst time and gives insight into various types of anomalies, but also draws attention to potentially critical, repeating anomalies. In developing x-PACS, we first construct a desiderata for the anomaly description problem.From a descriptive data mining perspective, our method exhibits five desired properties in our desiderata. Namely, it can unearth anomalous patterns(i) of multiple different types,(ii) hidden in arbitrary subspaces of a high dimensional space,(iii) interpretable by human analysts,(iv) different from normal patterns of the data, and finally(v) succinct, providing a short data description.No existing work on anomaly description satisfies all of these properties simultaneously.Furthermore, x-PACS is highly parallelizable; it is linear on the number of data points and exponential on the (typically small) largest characterizing subspace size. The anomalous patterns that x-PACS finds constitute interpretable signatures, and while it is not our primary goal, they can be used for anomaly detection.Through extensive experiments on real-world data sets, we show the effectiveness and superiority of x- PACS in anomaly explanation over various baselines, and demonstrate its competitive detection performance as compared to the state-of-the-art.

L1-Depth Revisited: A Robust Angle-based Outlier Factor in High-dimensional Space
Ninh Pham (University of Copenhagen)
Time: Tuesday, 16.20

Angle-based outlier detection (ABOD) has been recently emerged as an effective method to detect outliers in high dimensions. Instead of examining neighborhoods as proximity-based concepts, ABOD assesses the broadness of angle spectrum of a point as an outlier factor. Despite being a \emph{parameter-free} and robust measure in high-dimensional space, the exact solution of ABOD suffers from the cubic cost $O(n^3)$ regarding the data size $n$, hence cannot be used on large-scale data sets. In this work we present a \emph{conceptual} relationship between the ABOD intuition and the L1-depth concept in statistics, one of the earliest methods used for detecting outliers. Deriving from this relationship, we propose to use L1-depth as a variant of angle-based outlier factors, since it only requires a quadratic computational time as proximity-based outlier factors. Empirically, L1-depth is competitive (often superior) to proximity-based and other proposed angle-based outlier factors on detecting high-dimensional outliers regarding both efficiency and accuracy. In order to avoid the quadratic computational time, we introduce a simple but efficient sampling method named \emph{SamDepth} for estimating L1-depth measure. We also present theoretical analysis to guarantee the reliability of SamDepth. The empirical experiments on many real-world high-dimensional data sets demonstrate that SamDepth with $\sqrt{n}$ samples often achieves very competitive accuracy and runs several orders of magnitude faster than other proximity-based and ABOD competitors.

Beyond Outlier Detection: LookOut for Pictorial Explanation
Nikhil Gupta (IIT Delhi); Dhivya Eswaran (Carnegie Mellon University); Neil Shah (Snap Inc.); Leman Akoglu (Carnegie Mellon University); Christos Faloutsos (CMU, Pittsburgh)
Time: Tuesday, 16.40

Why is a given point in a dataset marked as an outlier by an off-the-shelf detection algorithm? Which feature(s) explain it the best? What is the best way to convince a human analyst that the point is indeed an outlier? We provide succinct, interpretable, and simple pictorial explanations of outlying behavior in multi-dimensional real-valued datasets while respecting the limited attention of human analysts. Specifically, we propose to output a few pictures (focus-plots, ie., pair-wise feature plots) from a few, carefully chosen feature sub-spaces. The proposed method, called LookOut, makes four contributions: (a) problem formulation: we introduce an “analyst-centered” problem formulation for explaining outliers via focus-plots, (b) explanation algorithm: we propose a plot-selection objective and the LookOut algorithm to approximate it with optimality guarantees, (c) generality: our explanation algorithm is both domain- and detector-agnostic, and (d) scalability: LookOut scales linearly with the size of input outliers to explain and the explanation budget. Our experiments show that LookOut performs near-ideally in terms of maximizing explanation objective on several real datasets, while producing fast, visually interpretable and intuitive results in explaining groundtruth outliers.

[Add paper to your calendar]  [Download paper] Reproducible Research

Contextual Outlier Detection with Multiple Contexts: Application to Ad Fraud
Meghanath Macha (Carnegie Mellon University); Deepak Pai (Adobe); Leman Akoglu (CMU)
Time: Tuesday, 17.00

Outlier detection has numerous applications in different domains. A family of techniques, called contextual outlier detectors, are based on a single, user-specified demarcation of data attributes into indicators and contexts. In this work, we propose ConOut, a new contextual outlier detection technique that leverages multiple contexts that are automatically identified. Importantly, ConOut is a one-click algorithm—it does not require any user-specified (hyper)parameters. Through experiments on various real-world data sets, we show that ConOut outperforms existing baselines in detection accuracy. Further, we motivate and apply ConOut to the advertisement domain to identify fraudulent publishers, where ConOut not only improves detection but also provides statistically significant revenue gains to advertisers: a minimum of 57% compared to a naive fraud detector; and ~20% in revenue gains as well as ~34% in mean average precision compared to its nearest competitor.

[Add paper to your calendar]  [Download paper] Reproducible Research

Session: Applications
Location: Hogan Mezz 2
Time: Tuesday, 16:00

ColdRoute: Effective Routing of Cold Questions in Stack Exchange Sites
Jiankai Sun (Ohio State University), Abhinav Vishnu (Pacific Northwest National Laboratory), Aniket Chakrabarti (Microsoft), Charles Siegel (Pacific Northwest National Laboratory), Srinivasan Parthasarathy (Ohio State University)
Time: Tuesday, 16.00

Routing questions in Community Question Answer services (CQAs) such as Stack Exchange sites is a well studied problem. Yet, cold- start — a phenomena observed when a new question is posted is not well addressed by existing approaches. Additionally, cold questions posted by new askers present significant challenges to state-of-the-art approaches. We propose ColdRoute to address these challenges, which is able to handle the task of routing cold questions posted by new or existing askers to matching experts. Specifically, we use Factorization Machines on one-hot encoding of critical features such as question tags and compare our approach to well studied techniques such as CQARank and semantic matching (LDA, BoW and Doc2Vec). Using data from eight stack exchange sites, we are able to improve upon the routing metrics (precision$@1$, accuracy, MRR) over the state-of-the-art models such as semantic matching by $159.5\%$,$31.84\%$, and $40.36\%$ for cold questions posted by existing askers, and $123.1\%$, $27.03\%$, and $34.81\%$ for cold questions posted by new askers respectively.

Pedestrian Trajectory Prediction with Structured Memory Hierarchies
Tharindu Fernando (Queensland University of Technology); SIMON DENMAN (Queensland University of Technology, Australia); Sridha Sridharan (QUT); Clinton Fookes (Queensland University of Technology)
Time: Tuesday, 16.20

This paper presents a novel framework for human trajectory prediction based on multimodal data (video and radar). Motivated by recent neuroscience discoveries, we propose incorporating a structured memory component in the human trajectory prediction pipeline to capture historical information to improve performance. We introduce structured LSTM cells for modelling the memory content hierarchically, preserving the spatiotemporal structure of the information and enabling us to capture both short term and long term context. We demonstrate how this architecture can be extended to integrate salient information from multiple modalities to automatically store and retrieve important information for decision making without any supervision. We evaluate the effectiveness of the proposed models on a novel multimodal dataset that we introduce, consisting of 40,000 pedestrian trajectories, acquired jointly from a radar system and a CCTV camera system installed in a public place. The performance is also evaluated on the publicly available New York Grand Central pedestrian database. In both settings, the proposed models demonstrate their capability to better anticipate future pedestrian motion compared to existing state of the art.

Detecting Autism by Analyzing a Simulated Social Interaction
Hanna Drimalla (Humboldt-Universität zu Berlin); Niels Landwehr (Leibniz Institute for Agricultural Engineering and Bioeconomy); Irina Baskow (Humboldt-Universität Berlin); Behnoush Behnia ( Klinik für Psychiatrie und Psychotherapie, Charité-Universitätsmedizin Berlin, Campus Benjamin Franklin); Stefan Roepke (Klinik für Psychiatrie und Psychotherapie, Charité-Universitätsmedizin Berlin, Campus Benjamin Franklin); Isabel Dziobek (Humboldt-Universität zu Berlin); Tobias Scheffer (University of Potsdam)
Time: Tuesday, 16.40

Diagnosing autism spectrum conditions takes several hours by well-trained practitioners; therefore, standardized questionnaires are widely used for first-level screening. Questionnaires as a diagnostic tool, however, rely on self reflection—which is typically impaired in individuals with autism spectrum condition. We develop an alternative screening mechanism in which subjects engage in a simulated social interaction. During this interaction, the subjects’ voice, eye gaze, and facial expression are tracked, and features are extracted that serve as input to a predictive model. We find that a random-forest classifier on these features can detect autism spectrum condition accurately and functionally independently of diagnostic questionnaires. We also find that a regression model estimates the severity of the condition more accurately than the reference screening method.

Face-Cap: Image Captioning using Facial Expression Analysis
Omid Mohamad Nezami (Computing Department, Macquarie University, NSW 2109); Mark Dras (Computing Department, Macquarie University, NSW 2109); Peter Anderson (Georgia Tech); Len Hamey (Computing Department, Macquarie University, NSW 2109)
Time: Tuesday, 17.00

Image captioning is the process of generating a natural language description of an image. Most current image captioning models, however, do not take into account the emotional aspect of an image, which is very relevant to activities and interpersonal relationships represented therein. Towards developing a model that can produce human-like captions incorporating these, we use facial expression features extracted from images including human faces, with the aim of improving the descriptive ability of the model. In this work, we present two variants of our Face-Cap model, which embed facial expression features in different ways, to generate image captions. Using all standard evaluation metrics, our Face-Cap models outperform a state-of-the-art baseline model for generating image captions when applied to an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. An analysis of the captions finds that, perhaps surprisingly, the improvement in caption quality appears to come not from the addition of adjectives linked to emotional aspects of the images, but from more variety in the actions described in the captions.

[Add paper to your calendar]  [Download paper] Reproducible Research

A Discriminative Model for Identifying Readers and Assessing Text Comprehension from Eye Movements
Silvia Makowski (University of Potsdam); Lena Jäger (University of Potsdam); Ahmed Abdelwahab (Leibniz Institute for Agricultural Engineering and Bioeconomy); Niels Landwehr (Leibniz Institute for Agricultural Engineering and Bioeconomy); Tobias Scheffer (University of Potsdam)
Time: Tuesday, 17.20

We study the problem of inferring readers’ identities and estimating their level of text comprehension from observations of their eye movements during reading. We develop a generative model of individual gaze patterns (scanpaths) that makes use of lexical features of the fixated words. Using this generative model, we derive a Fisher-score representation of eye-movement sequences. We study whether a Fisher-SVM with this Fisher kernel and several reference methods are able to identify readers and estimate their level of text comprehension based on eye-tracking data. While none of the methods are able to estimate text comprehension accurately, we find that the SVM with Fisher kernel excels at identifying readers.

Session: Pattern and Sequence Mining
Location: Davin
Time: Tuesday, 16:00

Mining Periodic Patterns with a MDL Criterion
Esther Galbrun (Aalto University); Peggy Cellier (IRISA); Nikolaj Tatti (Aalto University); Alexandre Termier (Université Rennes 1); Bruno Cremilleux (Université de Caen Normandie)
Time: Tuesday, 16.00

The quantity of event logs available is increasing rapidly, be they produced by industrial processes, computing systems, or life tracking, for instance. It is thus important to design effective ways to uncover the information they contain. Because event logs often record repetitive phenomena, mining periodic patterns is especially relevant when considering such data. Indeed, capturing such regularities is instrumental in providing condensed representations of the event sequences. We present an approach for mining periodic patterns from event logs while relying on a Minimum Description Length (MDL) criterion to evaluate candidate patterns. Our goal is to extract a set of patterns that suitably characterises the periodic structure present in the data. We evaluate the interest of our approach on several real-world event log datasets.

Revisiting Conditional Functional Dependency Discovery: Splitting the “C” from the “FD”.
Joeri Rammelaere (University of Antwerp); Floris Geerts (University of Antwerp)
Time: Tuesday, 16.20

Many techniques for cleaning dirty data are based on enforcing some set of integrity constraints. Conditional functional dependencies (CFDs) are a combination of traditional Functional dependencies (FDs) and association rules, and are widely used as a constraint formalism for data cleaning. However, the discovery of such CFDs has received limited attention. In this paper, we regard CFDs as an extension of association rules, and present three general methodologies for (approximate) CFD discovery, each using a different way of combining pattern mining for discovering the conditions (the “C” in CFD) with FD discovery. We discuss how existing algorithms fit into these three methodologies, and introduce new techniques to improve the discovery process. We show that the right choice of methodology improves performance over the traditional CFD discovery method CTane.

[Add paper to your calendar]  [Download paper] Reproducible Research

Discovering Spatio-Temporal Latent Influence in Geographical Attention Dynamics
Minoru Higuchi (Ryukoku University); Kanji Matsutani (NTT West Corporation); Masahito Kumano (Ryukoku University); Masahiro Kimura (Ryukoku University)
Time: Tuesday, 16.40

We address the problem of modeling the occurrence process of events for visiting attractive places, called points-of-interest (POIs), in a sightseeing city in the setting of a continuous time-axis and a continuous spatial domain, which is referred to as modeling geographical attention dynamics. By combining a Hawkes process with a time-varying Gaussian mixture model in a novel way and incorporating the influence structure depending on time slots as well, we propose a probabilistic model for discovering the spatio-temporal influence structure among major sightseeing areas from the viewpoint of geographical attention dynamics, and aim to accurately predict POI visit events in the near future. We develop an efficient method of inferring the parameters in the proposed model from the observed sequence of POI visit events, and present an analysis method for the geographical attention dynamics. Using real data of POI visit events in a Japanese sightseeing city, we demonstrate that the proposed model outperforms conventional models in terms of predictive accuracy, and uncover the spatio-temporal influence structure among major sightseeing areas in the city from the perspective of geographical attention dynamics.

An Efficient Algorithm for Computing Entropic Measures of Feature Subsets
Frédéric Pennerath (CentraleSupélec – LORIA)
Time: Tuesday, 17.00

Entropic measures such as conditional entropy or mutual information have been used numerous times in pattern mining, for instance to characterize valuable itemsets or approximate functional dependencies. Strangely enough the fundamental problem of designing efficient algorithms to compute entropy of subsets of features (or mutual information of feature subsets relatively to some target feature) has received little attention compared to the analog problem of computing frequency of itemsets. The present article proposes to fill this gap: it introduces a fast and scalable method that computes entropy and mutual information for a large number of feature subsets by adopting the divide and conquer strategy used by FP-growth, one of the most efficient frequent itemset mining algorithm. In order to illustrate its practical interest, the algorithm is then used to solve the recently introduced problem of mining reliable approximate functional dependencies. It finally provides empirical evidences that in the context of non-redundant pattern extraction, the proposed method outperforms existing algorithms for both speed and scalability.

[Add paper to your calendar]  [Download paper] Reproducible Research

Fast Schemes for Online Record Linkage
Dimitrios Karapiperis (Hellenic Open University), Aris Gkoulalas-Divanis (IBM Watson Health), Vassilios S. Verykios (Hellenic Open University)
Time: Tuesday, 17.20

The process of integrating large volumes of data coming from disparate data sources, in order to detect records that refer to the same entities, has always been an important problem in both academia and industry. This problem becomes significantly more challenging when the integration involves a huge amount of records and needs to be conducted in a real-time fashion to address the requirements of critical applications. In this paper, we propose two novel schemes for online record linkage, which achieve very fast response times and high levels of recall and precision. Our proposed schemes embed the records into a Bloom filter space and employ the Hamming Locality-Sensitive Hashing technique for blocking. Each Bloom filter is hashed to a number of hash tables in order to amplify the probability of formulating similar Bloom filter pairs. The main theoretical premise behind our first scheme relies on the number of times a Bloom filter pair is formulated in the hash tables of the blocking mechanism. We prove that this number strongly depends on the distance of that Bloom filter pair. This correlation allows us to estimate in real-time the Hamming distances of Bloom filter pairs without performing the comparisons. The second scheme is progressive and achieves high recall, upfront during the linkage process, by continuously adjusting the sequence in which the hash tables are scanned, and also guarantees, with high probability, the identification of each similar Bloom filter pair. Our experimental evaluation, using four real-world data sets, shows that the proposed schemes outperform four state-of-the-art methods by achieving higher recall and precision, while being very efficient.

Session: ADS Health
Location: Nally
Time: Tuesday, 16:00

AMIE: Automatic Monitoring of Indoor Exercises
Tom Decroos (KU Leuven)
Time: Tuesday, 16.00

Patients with sports-related injuries need to learn to perform rehabilitative exercises with correct movement patterns. Unfortunately, the feedback a physiotherapist can provide is limited by the visitation frequency of the patient. We study the feasibility of a system that automatically provides feedback on correct movement patterns to patients using a Microsoft Kinect camera and Machine Learning techniques. We discuss several challenges related to the Kinect’s proprietary software, the Kinect data’s heterogeneity, and the Kinect data’s temporal component. We introduce AMIE, a machine learning pipeline that detects the exercise being performed, the exercise’s correctness, and if applicable, the mistake that was made. To evaluate AMIE, ten participants were instructed to perform three types of typical rehabilitation exercises (squats, forward lunges and side lunges) demonstrating both correct movement patterns and frequent types of mistakes, while being recorded with a Kinect. AMIE detects the type of exercise almost perfectly with 99% accuracy and the type of mistake with 73% accuracy.

[Add paper to your calendar]  [Download paper] Reproducible Research

Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies
Pieter Libin (Vrije Universiteit Brussel); Timothy Verstraeten (Vrije Universiteit Brussel); Diederik Roijers (Vrije Universiteit Brussel); Jelena Grujic (Vrije Universiteit Brussel); Kristof Theys (Katholieke Universiteit Leuven); Philippe Lemey (Katholieke Universiteit Leuven); Ann Nowé (Vrije Universiteit Brussel)
Time: Tuesday, 16.20

Pandemic influenza has the epidemic potential to kill millions of people. While various preventive measures exist (i.a., vaccination and school closures), deciding on strategies that lead to their most effective and efficient use, remains challenging. To this end, individual-based epidemiological models are essential to assist decision makers in determining the best strategy to curb epidemic spread. However, individual-based models are computationally intensive and therefore it is pivotal to identify the optimal strategy using a minimal amount of model evaluations. Additionally, as epidemiological modeling experiments need to be planned, a computational budget needs to be specified a priori. Consequently, we present a new sampling method to optimize the evaluation of preventive strategies using fixed budget best-arm identification algorithms. We use epidemiological modeling theory to derive knowledge about the reward distribution which we exploit using Bayesian best-arm identification algorithms (i.e., Top-two Thompson sampling and BayesGap). We evaluate these algorithms in a realistic experimental setting and demonstrate that it is possible to identify the optimal strategy using only a limited number of model evaluations, i.e., 2-to-3 times faster compared to the uniform sampling method, the predominant technique used for epidemiological decision making in the literature. Finally, we contribute and evaluate a statistic for Top-two Thompson sampling to inform the decision makers about the confidence of an arm recommendation.

Rough Set Theory as a Data Mining Technique: A case Study in Epidemiology and Cancer Incidence Prediction
Zaineb Chelly Dagdia (Aberyswyth University); Christine Zarges (Aberyswyth University); Benjamin Schannes (ENSAE); Martin Micalef (Actuaris); Lino Galiana (ENS); Benoit Rolland (Altran Technologies); Olivier de Fresnoye (Epidemium); Mehdi Benchoufi (Hôpital Hôtel Dieu)
Time: Tuesday, 16.40

A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on a large scale data and high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST). It considers epidemiological data that has been collected from three international institutions for the purpose of cancer incidence prediction. The concrete data set used aggregates about 5495 risk factors (features), spanning 32 years and 38 countries. Detailed experiments demonstrate that RST is relevant to real world big data applications as it can offer insights about the selected risk factors, speed up the learning process, assure the performance of the cancer incidence prediction model without a significant information loss, and simplify the learned model for epidemiologists.

Hypotensive Episode Prediction in ICU via Observation Window Splitting
Elad Tsur (Ben-Gurion University of the Negev); Mark Last (Ben-Gurion University of the Negev); Victor F Garcia (Cincinnati Children’s Hospital Medical Center); Raphael Udassin (Hadassah University Hospital, Ein Karem, Jerusalem)
Time: Tuesday, 17.00

Hypotension, defined as a low blood pressure, is a significant risk factor in intensive care units (ICU), which requires a prompt therapeutic intervention. The goal of our research is to predict a hypotensive episode by time-series analysis of continuously monitored physiologic vital signs. Particularly, we aim to give the physicians a one-hour warning using a prognostic model based on the last Observation Window (OW) at the prediction time. So far, other clinical episode prediction studies used a single OW of 5-120 minutes to extract predictive features, with no significant improvement when using longer OWs. We developed the In-Window Segmentation (InWiSe) method for time series prediction, which splits a single OW into several sub-windows of equal size. The resulting feature set combines the features extracted from each observation sub-window and we used the Extreme Gradient Boosting (XGBoost) binary classifier to produce an impending episode prediction model from the combined feature set. We evaluate the proposed approach on three retrospective ICU datasets (extracted from MIMIC II, Soroka and Hadassah databases) using cross-validation on each dataset separately as well as by cross-dataset validation. The results show that InWiSe is superior in terms of the AUC metric compared to existing methods.

Equipment Health Indicator Learning using Deep Reinforcement Learning
Chi Zhang (Hitachi America Ltd.); Chetan Gupta (Hitachi Big Data Laboratory, Hitachi Americas Ltd.); Ahmed Farahat (Hitachi Big Data Laboratory, Hitachi America, Ltd); Kosta Ristovski (Hitachi Big Data Laboratory, Hitachi Americas Ltd.); Dipanjan D Ghosh (Hitachi Big Data Laboratory, Hitachi Americas Ltd.)
Time: Tuesday, 17.20

Predictive Maintenance (PdM) is gaining popularity in industrial operations as it leverages the power of Machine Learning and Internet of Things (IoT) to predict the future health status of equipment. Health Indicator Learning (HIL) plays an important role in PdM as it learns a health curve representing the health conditions of equipment over time, so that health degradation is visually monitored and optimal planning can be performed accordingly to minimize the equipment downtime. However, HIL is a hard problem due to the fact that there is usually no way to access the actual health of the equipment during most of its operation. Traditionally, HIL is addressed by hand-crafting domain-specific performance indicators or through physical modeling, which is expensive and inapplicable for some industries. In this paper, we propose a purely data-driven approach for solving the HIL problem based on Deep Reinforcement Learning (DRL). Our key insight is that the HIL problem can be mapped to a credit assignment problem. Then DRL learns from failures by naturally backpropagating the credit of failures into intermediate states. In particular, given the observed time series data of sensor readings, operational and event (failure) data, we learn a sequence of health indicators that represent the underlying health conditions of physical equipment. We demonstrate that the proposed methods significantly outperform the state-of-the-art methods for HIL and provide explainable insights about the equipment health. In addition, we propose the use of the learned health indicators to predict when the equipment is going to reach its end-of-life, and demonstrate how an explainable health curve is way more useful for a decision maker than a single-number prediction by a black-box model. The proposed approach has a great potential in a broader range of systems (e.g., economical and biological) as a general framework for the automatic learning of the underlying performance of complex systems.

Can We Assess Mental Health through Social Media and Smart Devices? Addressing Bias in Methodology and Evaluation
Adam Tsakalidis (University of Warwick); Maria Liakata (University of Warwick); Theodoros Damoulas (University of Warwick); Alexandra Cristea (University of Warwick)
Time: Tuesday, 17.40

Predicting mental health from smartphone and social media data on a longitudinal basis has recently attracted great interest, with very promising results being reported across many studies. Such approaches have the potential to revolutionise mental health assessment, if their development and evaluation follows a real world deployment setting. In this work we take a closer look at state-of-the-art approaches, using different mental health datasets and indicators, different feature sources and multiple simulations, in order to assess their ability to generalise. We demonstrate that under a pragmatic evaluation framework, none of the approaches deliver or even approach the reported performances. In fact, we show that current state-of-the-art approaches can barely outperform the most naïve baselines in the real-world setting, posing serious questions not only about their deployment ability, but also about the contribution of the derived features for the mental health assessment task and how to make better use of such data in the future.

Session: Demo & Poster 1
Location: Hogan
Time: Tuesday, 18:30

IDEA: An Interactive Dialogue Translation Demo System Using Furhat Robot
Jinhua Du (ADAPT Centre, School of Computing, Dublin City University)
Time: Tuesday, 18.30

We showcase IDEA, an Interactive DialoguE trAnslation system using Furhat robots, whose novel contributions are: (i) it is a web service-based application combining translation service, speech recognition service and speech synthesis service; (ii) it is a task-oriented hybrid machine translation system combining statistical and neural machine learning methods for domain-specific named entity (NE) recognition and translation; and (iii) it provides user-friendly interactive interface using Furhat robot with speech input, output, head movement and facial emotions. IDEA is a case-study demo which can efficiently and accurately assist customers and agents in different languages to reach an agreement in a dialogue for the hotel booking.

RAPID: Real-time Analytics Platform for Interactive Data MiningIDEA: An Interactive Dialogue Translation Demo System Using Furhat Robot
Kwan Hui Lim (The University of Melbourne); Sachini Jayasekara (The University of Melbourne); Shanika Karunasekera (The University of Melbourne, Australia); Aaron Harwood (University of Melbourne); Lucia Falzon (Defence Science and Technology Group); John Dunn (Defence Science and Technology Group); Glenn Burgess (Defence Science and Technology Group)
Time: Tuesday, 18.30

Twitter is a popular social networking site that generates a large volume and variety of tweets, thus a key challenge is to filter and track relevant tweets and identify the main topics discussed in real-time. For this purpose, we developed the Real-time Analytics Platform for Interactive Data mining (RAPID) system, which provides an effective data collection mechanism through query expansion, numerous analysis and visualization capabilities for understanding user interactions, tweeting behaviours, discussion topics, and other social patterns.

Interactive Time Series Clustering with COBRAS-TS
Toon Van Craenendonck (KU LEUVEN); Wannes Meert (KU Leuven); Sebastijan Dumancic (KU LEUVEN); Hendrik Blockeel (K.U. Leuven)
Time: Tuesday, 18.30

Time series are ubiquitous, resulting in substantial interest in time series data mining. Clustering is one of the most widely used techniques in this setting. Recent work has shown that time series clustering can benefit greatly from small amounts of supervision in the form of pairwise constraints. Such constraints can be obtained by asking the user to answer queries of the following type: should these two instances be in the same cluster? Answering ‘yes’ results in a must-link constraint, ‘no’ results in a cannot-link. In this paper we present an interactive clustering system that exploits such constraints. It is implemented on top of the recently introduced COBRAS-TS method. The system repeats the following steps until a satisfactory clustering is obtained: it presents several pairwise queries to the user through a visual interface, uses the resulting pairwise constraints to improve the clustering, and shows this new clustering to the user. Our system is readily available and comes with an easy-to-use interface, making it an effective tool for anyone interested in analyzing time series data.

pysubgroup: Easy-to-use Subgroup Discovery in Python
Florian Lemmerich (University of Wurzburg); Martin Becker (University of W³rzburg)
Time: Tuesday, 18.30

This paper introduces the pysubgroup package for subgroup discovery in Python. Subgroup discovery is a well-established data mining task that aims at identifying describable subsets in the data that show an interesting distribution with respect to a certain target concept. The presented package provides, an easy-to-use, compact and extensible implementation of state-of-the-art mining algorithms, interestingness measures, and visualizations. Since it builds directly on the established pandas data analysis library — a de-facto standard for data science in Python — it seamlessly integrates into preprocessing and exploratory data analysis steps.

An Advert Creation System for Next-Gen Publicity
Soumyabrata Dev (The ADAPT SFI Research Centre)
Time: Tuesday, 18.30

With the rapid proliferation of multimedia data in the internet, there has been a fast rise in the creation of videos for the viewers. This enables the viewers to skip the advertisement breaks in the videos, using ad blockers and ‘skip ad’ buttons – bringing online marketing and publicity to a stall. In this paper, we demonstrate a system that can effectively integrate a new advertisement into a video sequence. We use state-of-the-art techniques from deep learning and computational photogrammetry, for effective detection of existing adverts, and seamless integration of new adverts into video sequences. This is helpful for targeted advertisement, paving the path for next-gen publicity.

VHI: Valve Health Identification for the Maintenance of Subsea Industrial Equipment
Muhammad Atif Qureshi (University College Dublin); Luis Miralles (UCD); Jing Su (University College Dublin); Jason Payne (Wood.); Ronan Omalley (Wood.)
Time: Tuesday, 18.30

Subsea valves are a key piece of equipment in the extraction process of oil and natural gas. Valves control the flow of fluids by opening and closing passageways. A malfunctioning valve can lead to significant operational losses.In this paper, we describe VHI, a system designed to assist maintenance engineers with condition-based monitoring services for valves. VHI addresses the challenge of maintenance in two ways: a supervised approach that predicts impending valve failure, and an unsupervised approach that identifies and highlights anomalies i.e., an unusual valve behaviour. While the supervised approach is suitable for valves with long operational history, the unsupervised approach is suitable for valves with no operational history.

Wednesday Papers by Session

Session: Plenary 1
Location: Hogan
Time: Wednesday, 11:00

Best student DM paper runner up: Clustering in the Presence of Concept Drift
Richard H Moulton (University of Ottawa); Herna L Viktor (University of Ottawa); Nathalie Japkowicz (American University); Joao Gama (INESC TEC – LIAAD)
Time: Wednesday, 11.00

Clustering naturally addresses many of the challenges of data streams and many data stream clustering algorithms (DSCAs) have been proposed. The literature does not, however, provide quantitative descriptions of how these algorithms behave in different circumstances. In this paper we study how the clusterings produced by different DSCAs change, relative to the ground truth, as quantitatively different types of concept drift are encountered. This paper makes two contributions to the literature. First, we propose a method for generating real-valued data streams with precise quantitative concept drift. Second, we conduct an experimental study to provide quantitative analyses of DSCA performance with synthetic real-valued data streams and show how to apply this knowledge to real world data streams. We find that large magnitude and short duration concept drifts are most challenging and that DSCAs with partitioning-based offline clustering methods are generally more robust than those with density-based offline clustering methods. Finally, we find that increasing the number of classes present in a stream is a more challenging environment than decreasing the number of classes.

Best student DM paper runner up: GridWatch: Sensor Placement and Anomaly Detection in the Electrical Grid
Bryan Hooi (Carnegie Mellon University); Dhivya Eswaran (Carnegie Mellon University); Hyun Ah Song (Carnegie Mellon University); Amritanshu Pandey (Carnegie Mellon University); Marko Jereminov (Carnegie Mellon University); Larry Pileggi (Carnegie Mellon University); Christos Faloutsos (CMU, Pittsburgh)
Time: Wednesday, 11.20

Given sensor readings over time from a power grid consisting of nodes (e.g. generators) and edges (e.g. power lines), how can we most accurately detect when an electrical component has failed? More challengingly, given a limited budget of sensors to place, how can we determine where to place them to have the highest chance of detecting such a failure? Maintaining the reliability of the electrical grid is a major challenge. An important part of achieving this is to place sensors in the grid, and use them to detect anomalies, in order to quickly respond to a problem. Our contributions are: 1) Online anomaly detection: we propose a novel, online anomaly detection algorithm that outperforms existing approaches. 2) Sensor placement: we construct an optimization objective for sensor placement, with the goal of maximizing the probability of detecting an anomaly. We show that this objective has the property of submodularity, which we exploit in our sensor placement algorithm. 3) Effectiveness: Our sensor placement algorithm is provably near-optimal, and both our algorithms outperform existing approaches in accuracy by 59% or more (F-measure) in experiments. 4) Scalability: our algorithms scale linearly, and our detection algorithm is online, requiring bounded space and constant time per update.

[Add paper to your calendar]  [Download paper] Reproducible Research

Best student ML paper runner up: Incorporating Privileged Information to Unsupervised Anomaly Detection
Shubhranshu Shekhar (Carnegie Mellon University); Leman Akoglu (CMU)
Time: Wednesday, 11.40

We introduce a new unsupervised anomaly detection ensemble called SPI which can harness privileged information—data available only for training examples but not for (future) test examples. Our ideas build on the Learning Using Privileged Information (LUPI) paradigm pioneered by Vapnik et al. [19,17], which we extend to unsupervised learn- ing and in particular to anomaly detection. SPI (for Spotting anomalies with Privileged Information) constructs a number of frames/fragments of knowledge (i.e., density estimates) in the privileged space and transfers them to the anomaly scoring space through “imitation” functions that use only the partial information available for test examples. Our generalization of the LUPI paradigm to unsupervised anomaly detection shepherds the field in several key directions, including (i) domain- knowledge-augmented detection using expert annotations as PI, (ii) fast detection using computationally-demanding data as PI, and (iii) early detection using “historical future” data as PI. Through extensive experiments on simulated and real datasets, we show that augmenting privileged information to anomaly detection significantly improves detection performance. We also demonstrate the promise of SPI under all three settings (i–iii); with PI capturing expert knowledge, computationally- expensive features, and future data on three real world detection tasks.

[Add paper to your calendar]  [Download paper] Reproducible Research

Best student ML paper runner up: Sqn2Vec: Learning Sequence Representation via Sequential Patterns with a Gap Constraint
Dang Nguyen (Deakin University); Wei Luo (Deakin University); Tu Nguyen (“Deakin University, Australia”); Svetha Venkatesh (Deakin University); Dinh Phung (“Deakin University, Australia”)
Time: Wednesday, 12.00

When learning sequence representations, traditional pattern-based methods often suffer from the data sparsity and high-dimensionality problems while recent neural embedding methods often fail on sequential datasets with a small vocabulary. To address these disadvantages, we propose an unsupervised method (named Sqn2Vec) which first leverages sequential patterns (SPs) to increase the vocabulary size and then learns low-dimensional continuous vectors for sequences via a neural embedding model. Moreover, our method enforces a gap constraint among symbols in sequences to obtain meaningful and discriminative SPs. Consequently, Sqn2Vec produces significantly better sequence representations than a comprehensive list of state-of-the-art baselines, particularly on sequential datasets with a relatively small vocabulary. We demonstrate the superior performance of Sqn2Vec in several machine learning tasks including sequence classification, clustering, and visualization.

Deep Learning Architecture Search by Neuro-Cell-based Evolution with Function-Preserving Mutations
Martin Wistuba (IBM Research)
Time: Wednesday, 12.20

The design of convolutional neural network architectures for a new image data set is a laborious and computational expensive task which requires expert knowledge. We propose a novel neuro-evolutionary technique to solve this problem without human interference. Our method assumes that a convolutional neural network architecture is a sequence of neuro-cells and keeps mutating them using function-preserving operations. This novel combination of approaches has several advantages. We define the network architecture by a sequence of repeating neuro-cells which reduces the search space complexity. Furthermore, these cells are possibly transferable and can be used in order to arbitrarily extend the complexity of the network. Mutations based on function-preserving operations guarantee better parameter initialization than random initialization such that less training time is required per network architecture. Our proposed method finds within 12 GPU hours neural network architectures that can achieve a classification error of about 4% and 24% with only 5.5 and 6.5 million parameters on CIFAR-10 and CIFAR-100, respectively. In comparison to competitor approaches, our method provides similar competitive results but requires orders of magnitudes less search time and in many cases less network parameters.

Session: Plenary 2
Location: Hogan
Time: Wednesday, 15:40

CentroidNet: A Deep Neural Network for Joint Object Localization and Counting
Klaas Dijkstra (NHL Stenden University of Applied Sciences); Jaap van de Loosdrecht (NHL Stenden University of Applied Sciences); Lambert Schomaker (University of Groningen); Marco Wiering (University of Groningen)
Time: Wednesday, 15.40

In precision agriculture, counting and precise localization of crops is important for optimizing crop yield. In this paper CentroidNet is introduced which is a Fully Convolutional Neural Network (FCNN) architecture specifically designed for object localization and counting. A field of vectors pointing to the nearest object centroid is trained and combined with a learned segmentation map to produce accurate object centroids by majority voting. This is tested on a crop dataset made using a UAV (drone) and on a cell-nuclei dataset which was provided by a Kaggle challenge. We define the mean Average F1 score (mAF1) for measuring the trade-off between precision and recall. CentroidNet is compared to the state-of-the-art networks YOLOv2 and RetinaNet, which share similar properties. The results show that CentroidNet obtains the best F1 score. We also explicitly show that CentroidNet can seamlessly switch between patches of images and full-resolution images without the need for retraining.

Deep Modular Multimodal Fusion on Multiple Sensors for Volcano Activity Recognition
Hiep V. Le (Tokyo Institute of Technology); Tsuyoshi Murata (Tokyo Institute of Technology)
Time: Wednesday, 16.00

Nowadays, with the development of sensor techniques and the growth in a number of volcanic monitoring systems, more and more data about volcanic sensor signals are gathered. This results in a need for mining these data to study the mechanism of the volcanic eruption. This paper focuses on Volcano Activity Recognition (VAR) where the inputs are multiple sensor data obtained from the volcanic monitoring system in the form of time series. And the output of this research is the volcano status which is either explosive or not explosive. It is hard even for experts to extract handcrafted features from these time series. To solve this problem, we propose a deep neural network architecture called VolNet which adapts Convolutional Neural Network for each time series to extract non-handcrafted feature representation which is considered powerful to discriminate between classes. By taking advantages of VolNet as building block, we propose a simple but effective fusion model called Deep Modular Multimodal Fusion (DMMF) which adapts data grouping as the guidance to design the architecture of fusion model. Different from conventional multimodal fusion where the features are concatenated all at once at the fusion step, DMMF fuses relevant modalities in different modules separately in a hierarchical fashion. We conducted extensive experiments to demonstrate the effectiveness of VolNet and DMMF on the volcanic sensor datasets obtained from Sakurajima volcano, which are the biggest volcanic sensor datasets in Japan. The experiments showed that DMMF outperformed the current state-of-the-art fusion model with the increase of F-score up to 1.9% on average.

Session: Plenary 3
Location: Hogan
Time: Wednesday, 16:00

Analyzing concept drift and shift from sample data
Geoffrey I. Webb (Monash University), Loong Kuan Lee (Monash University), Francois Petitjean (Monash University), Bart Goethals (Monash University and University of Antwerp)
Time: Wednesday, 16.00

Concept drift and shift are major issues that greatly affect the accuracy and reliability of many real-world applications of machine learning. We propose a new data mining task, concept drift mapping – the description and analysis of instances of concept drift or shift. We argue that concept drift mapping is an essential prerequisite for tackling concept drift and shift. We propose tools for this purpose, arguing for the importance of quantitative descriptions of drift and shift in marginal distributions. We present quantitative concept drift mapping techniques, along with methods for visualizing their results. We illustrate their effectiveness for real-world applications across energy- pricing, vegetation monitoring and airline scheduling.

Inverse Reinforcement Learning from Incomplete Observation Data
Antti Kangasrääsiö (Aalto University), Samuel Kaski (Aalto University)
Time: Wednesday, 16.20

Inverse reinforcement learning (IRL) aims to explain observed strategic behavior by fitting reinforcement learning models to behavioral data. However, traditional IRL methods are only applicable when the observations are in the form of state-action paths. This assumption may not hold in many real-world modeling settings, where only partial or summarized observations are available. In general, we may assume that there is a summarizing function $\sigma$, which acts as a filter between us and the true state-action paths that constitute the demonstration. Some initial approaches to extending IRL to such situations have been presented, but with very specific assumptions about the structure of $\sigma$, such as that only certain state observations are missing. This paper instead focuses on the most general case of the problem, where no assumptions are made about the summarizing function, except that it can be evaluated. We demonstrate that inference is still possible. The paper presents exact and approximate inference algorithms that allow full posterior inference, which is particularly important for assessing parameter uncertainty in this challenging inference situation. Empirical scalability is demonstrated to reasonably sized problems, and practical applicability is demonstrated by estimating the posterior for a cognitive science RL model based on an observed user’s task completion time only.

How Your Supporters and Opponents Define Your Interestingness
Bruno Cremilleux (Université de Caen Normandie); Arnaud Giacometti (University Francois Rabelais of Tours, France); Arnaud Soulet (University of Tours, France)
Time: Wednesday, 16.40

How can one determine whether a data mining method extracts interesting patterns? The paper deals with this core question in the context of unsupervised problems with binary data. We formalize the quality of a data mining method by identifying patterns – the supporters and opponents – which are related to a pattern extracted by a method.We define a typology offering a global picture of the methods based on two complementary criteria to evaluate and interpret their interests. The quality of a data mining method is quantified via an evaluation complexity analysis based on the number of supporters and opponents of a pattern extracted by the method. We provide an experimental study on the evaluation of the quality of the methods.

Learning Cheap and Novel Flight Itineraries
Dmytro Karamshuk (Skyscanner); David Matthews (Skyscanner)
Time: Wednesday, 17.00

We consider the problem of efficiently constructing cheap and novel round trip flight itineraries by combining legs from different airlines. We analyse the factors that contribute towards the price of such itineraries and find that many result from the combination of just 30% of airlines and that the closer the departure of such itineraries is to the user’s search date the more likely they are to be cheap. We use these insights to formulate the problem as a trade-off between the recall of cheap itinerary constructions and the costs associated with building them. We propose a supervised learning solution with location embeddings which achieves an AUC=80.48, a substantial improvement over simpler baselines. We discuss various practical considerations for dealing with the staleness and the stability of the model and present the design of the machine learning pipeline. Finally, we present an analysis of the model’s performance in production and its impact on Skyscanner’s users.

Neural Article Pair Modeling for Wikipedia Sub-article Matching
Muhao Chen (UCLA); ChangPing Meng (Prudue University); Gang Huang (Google Inc.); Carlo Zaniolo (UCLA, USA)
Time: Wednesday, 17.20

Nowadays, editors tend to separate different subtopics of a long Wikipedia article into multiple sub-articles. This separation seeks to improve human readability. However, it also has a deleterious effect on many Wikipedia-based tasks that rely on the article-as-concept assumption, which requires each entity (or concept) to be described solely by one article. This underlying assumption significantly simplifies knowledge representation and extraction, and it is vital to many existing technologies such as automated knowledge base construction, cross-lingual knowledge alignment, semantic search and data lineage of Wikipedia entities. In this paper we provide an approach to match the scattered sub-articles back to their corresponding main-articles, with the intent of facilitating automated Wikipedia curation and processing. The proposed model adopts a hierarchical learning structure that combines multiple variants of neural document pair encoders with a comprehensive set of explicit features. A large crowdsourced dataset is created to support the evaluation and feature extraction for the task. Based on the large dataset, the proposed model achieves promising results of cross-validation and significantly outperforms previous approaches. Large-scale serving on the entire English Wikipedia also proves the practicability and scalability of the proposed model by effectively extracting a vast collection of newly paired main and sub-articles.

Thursday Papers by Session

Session: Best student DM paper
Location: Hogan
Time: Thursday, 10:00

Anytime Subgroup Discovery in Numerical Domains with Guarantees
Aimene Belfodil (INSA Lyon, CNRS, LIRIS); Adnene Belfodil (INSA Lyon, CNRS, LIRIS); Mehdi Kaytoue (Infologic)
Time: Thursday, 11.00

Subgroup discovery is the task of discovering patterns that accurately discriminate a class label from the others. Existing approaches can uncover such patterns either through an exhaustive or an approximate exploration of the pattern search space. However, an exhaustive exploration is generally unfeasible whereas approximate approaches do not provide guarantees bounding the error of the best pattern quality nor the exploration progression (“How far are we of an exhaustive search”). We design here an algorithm for mining numerical data with three key properties w.r.t. the state of the art: (i) It yields progressively interval patterns whose quality improves over time; (ii) It can be interrupted anytime and always gives a guarantee bounding the error on the top pattern quality and (iii) It always bounds a distance to the exhaustive exploration. After reporting experimentations showing the effectiveness of our method, we discuss its generalization to other kinds of patterns.

[Add paper to your calendar] [Download paper] Reproducible Research

Session: Graphs 3
Location: Hogan
Time: Thursday, 11:00

Graph sampling with applications to estimating the number of pattern embeddings and the parameters of a statistical relational model
Irma Ravkic (KU Leuven), Martin Žnidaršič (Jožef Stefan Institute), Jan Ramon (KU Leuven), Jesse Davis (KU Leuven)
Time: Thursday, 11.00

Counting the number of times a pattern occurs in a database is a fundamental data mining problem. It is a subroutine in a diverse set of tasks ranging from pattern mining to supervised learning and probabilistic model learning. While a pattern and a database can take many forms, this paper focuses on the case where both the pattern and the database are graphs (networks). Unfortunately, in general, the problem of counting graph occurrences is #P-complete. In contrast to earlier work, which focused on exact counting for simple (i.e., very short) patterns, we present a sampling approach for estimating the statistics of larger graph pattern occurrences. We perform an empirical evaluation on synthetic and real-world data that validates the proposed algorithm, illustrates its practical behavior and provides insight into the trade-off between its accuracy of estimation and computational efficiency.

Social-Affiliation Networks: Patterns and the SOAR Model
Dhivya Eswaran (Carnegie Mellon University); Reihaneh Rabbany (McGill University); Artur Dubrawski (CMU); Christos Faloutsos (CMU, Pittsburgh)
Time: Thursday, 11.20

Given a social-affiliation network – a friendship graph where users have many, binary attributes e.g., check-ins, page likes or group memberships – what rules do its structural properties such as edge or triangle counts follow, in relation to its attributes? More challengingly, how can we synthetically generate realistic networks which provably satisfy those rules or patterns? Our work attempts to answer the above closely-related questions in the context of the increasingly prevalent social-affiliation graphs. Our contributions are two-fold: (a) Patterns: we discover three new rules (power laws) in the properties of attribute-induced subgraphs, substructures which connect the friendship structure to affiliations; (b) Model: we propose SOAR– short for SOcial-Affiliation graphs via Recursion – a stochastic model based on recursion and self-similarity, to provably generate graphs obeying the observed patterns. Experiments show that: (i) the discovered rules are useful in detecting deviations as anomalies and (ii) SOAR is fast and scales linearly with network size, producing graphs with millions of edges and attributes in only a few seconds.

[Add paper to your calendar]  [Download paper] Reproducible Research

Semi-Supervised Blockmodelling with Pairwise Guidance
Mohadeseh Ganji (The University pf Melbourne); Jeffrey Chan (RMIT University, Australia); Peter. J Stuckey (The University of Melbourne); James Bailey (THE UNIVERSITY OF MELBOURNE); Christopher Leckie (University of Melbourne); Kotagiri Ramamohanarao (The University of Melbourne); Laurence Park (Western Sydney University)
Time: Thursday, 11.40

Blockmodelling is an important technique for detecting underlying patterns in graphs. Existing blockmodelling algorithms are unsupervised and cannot take advantage of the existing information that might be available about objects that are known to be similar. This background information can help finding complex patterns, such as hierarchical or ring blockmodel structures, which are difficult for traditional blockmodelling algorithms to detect. In this paper, we propose a new semi-supervised framework for blockmodelling, which allows background information to be incorporated in the form of pairwise membership information. Our proposed framework is based on the use of Lagrange multipliers and can be incorporated into existing iterative blockmodelling algorithms, enabling them to find complex blockmodel patterns in graphs. We demonstrate the utility of our framework for discovering complex patterns, via experiments over a range of synthetic and real data sets.

[Add paper to your calendar]  [Download paper] Reproducible Research

Discovering Urban Travel Demands through Dynamic Zone Correlation in Location-Based Social Networks
WANGSU HU (Rutgers); ZIJUN YAO (Rutgers); Sen Yang (Rutgers University); Shuhong Chen (Rutgers University); Peter J Jin (Rutgers University)
Time: Thursday, 12.00

Location-Based Social Networks (LBSN), which enable mobile users to announce their locations by checking-in to Points-of-Interests (POI), has accumulated a huge amount of user-POI interaction data. Compared to traditional sensor data, check-in data provides the much-needed information about trip purpose, which is critical to motivate human mobility but was not available for travel demand studies. In this paper, we aim to exploit the rich check-in data to model dynamic travel demands in urban areas, which can support a wide variety of mobile business solutions. Specifically, we first profile the functionality of city zones using the categorical density of POIs. Second, we use a Hawkes Process-based State-Space formulation to model the dynamic trip arrival patterns based on check-in arrival patterns. Third, we developed a joint model that integrates Pearson Product-Moment Correlation (PPMC) analysis into zone gravity modeling to perform dynamic Origin-Destination (OD) prediction. Last, we validated our methods using real-world LBSN and transportation data of New York City. The experimental results demonstrate the effectiveness of the proposed method for modeling dynamic urban travel demands. Our method achieves a significant improvement on OD prediction compared to baselines.

Similarity Modeling on Heterogeneous Networks via Automatic Path Discovery
Carl Yang (University of Illinois, Urbana Champaign); Mengxiong Liu (University of Illinois, Urbana Champaign); Frank S. He (University of Illinois, Urbana Champaign); Xikun Zhang (University of Illinois, Urbana Champaign); Jian Peng (UIUC); Jiawei Han (UIUC)
Time: Thursday, 12.20

Heterogeneous networks are widely used to model real-world semi-structured data. The key challenge of learning over such networks is the modeling of node similarity under both network structures and contents. To deal with network structures, most existing works assume a given or enumerable set of meta-paths and then leverage them for the computation of meat-path-based proximities or network embeddings. However, expert knowledge for given meta-paths is not always available, and as the length of considered meta-paths increases, the number of possible paths grows exponentially, which makes the path searching process very costly. On the other hand, while there are often rich contents around network nodes, they have hardly been leveraged to further improve similarity modeling. In this work, to properly model node similarity in content-rich heterogeneous networks, we propose to automatically discover useful paths for pairs of nodes under both structural and content information. To this end, we combine continuous reinforcement learning and deep content embedding into a novel semi-supervised joint learning framework. Specifically, the supervised reinforcement learning component explores useful paths between a small set of example similar pairs of nodes, while the unsupervised deep embedding component captures node contents and enables inductive learning on the whole network. The two components are jointly trained in a closed loop to mutually enhance each other. Extensive experiments on three real-world heterogeneous networks demonstrate the supreme advantages of our algorithm.

[Add paper to your calendar]  [Download paper] Reproducible Research

Session: Foundations of Machine Learning and Data Mining
Location: Hogan Mezz 1
Time: Thursday, 11:00

Linear regression for uplift modeling
Krzysztof Rudaś (Warsaw University of Technology), Szymon Jaroszewicz (Polish Academy of Sciences)
Time: Thursday, 11.00

The purpose of statistical modeling frequently is the selection oftargets for some action such as a medical treatment or a marketing campaign. Unfortunately classical machine learning algorithms are not well suited to this task since they predict the results afterthe action and not its causal impact. The answer to this problem isuplift modeling which in addition to the usual training set describing objects on which the action was taken uses an additional control group of objects not subjected to the action and predicts the true effect ofthe action on a given individual as the difference between responsesin both groups. This paper analyzes theoretically in the context oflinear regression two common uplift modeling approaches, one based on the use of two separate models and the other based on target variable transformation. Adapting the second estimator to the problem of regression is another contribution of the paper. We identify thesituations when each model performs best and, contrary to several claims in literature, show that the double model approach hasfavorable theoretical properties and often performs well in practice. Finally based on our analysis we propose a third model which combines the benefits of both approaches and seems to be the model of choicefor uplift linear regression.

Global Multi-output Decision Trees for Interaction Prediction
Konstantinos Pliakos (KU Leuven), Pierre Geurts (University of Liège), Celine Vens (KU Leuven)
Time: Thursday, 11.20

Interaction data are characterized by two sets of objects, each described by their own set of features. They are often modeled as networks and the values of interest are the possible interactions between two instances, represented usually as a matrix. Here, a novel global decision tree learning method is proposed, where multi-output decision trees are constructed over the global interaction setting, addressing the problem of interaction prediction as a multi-label classification task. More specifically, the tree is constructed by splitting the interaction matrix both row-wise and column-wise, incorporating this way both interaction dataset features in the learning procedure. Experiments are conducted across several heterogeneous interaction datasets from the biomedical domain. The experimental results indicate the superiority of the proposed method against other decision tree approaches in terms of predictive accuracy, model size and computational efficiency. The performance is boosted by fully exploiting the multi-output structure of the model. We conclude that the proposed method should be considered in interaction prediction tasks, especially where interpretable models are desired.

MetaBags: Bagged Meta-Decision Trees for Regression
Jihed Khiari (NEC Laboratories Europe); Luis Matias (NEC Laboratories Europe); Ammar Shaker (NEC Laboratories Europe); Saso Dzeroski (Jozef Stefan Institute, Ljubljana, Slovenia); Bernard Zenko (Jozef Stefan Institute)
Time: Thursday, 11.40

Methods for learning heterogeneous regression ensembles have not yet been proposed on a large scale. Hitherto, in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel stacking framework for regression. MetaBags learns a set of meta-decision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. Finally, these predictions are aggregated into a single prediction through a bagging procedure at meta-level. MetaBags is designed to learn a model with a fair bias-variance trade-off, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on open, synthetic and real-world application datasets. The obtained results show that our method outperforms existing state-of-the-art approaches.

Mining Tree Patterns with Partially Injective Homomorphisms
Till Schulz (University of Bonn); Tamas Horvath (University of Bonn and Fraunhofer IAIS); Pascal Welke (University of Bonn); Stefan Wrobel (Fraunhofer IAIS)
Time: Thursday, 12.00

One of the main differences between inductive logic programming (ILP) and graph mining lies in the pattern matching operator applied: While it is mainly defined by relational homomorphism (i.e., subsumption) in ILP, subgraph isomorphism is the most common pattern matching operator in graph mining. Using the fact that subgraph isomorphisms are injective homomorphisms, we bridge the gap between ILP and graph mining by considering a natural transition from homomorphisms to subgraph isomorphisms that is defined by partially injective homomorphisms, i.e., which require injectivity only for subsets of the vertex pairs in the pattern. Utilizing positive complexity results on deciding homomorphisms from bounded tree-width graphs, we present an algorithm mining frequent trees from arbitrary graphs w.r.t. partially injective homomorphisms. Our experimental results show that the predictive performance of the patterns obtained is comparable to that of ordinary frequent subgraphs. Thus, by preserving much from the advantageous properties of homomorphisms and subgraph isomorphisms, our approach provides a trade-off between efficiency and predictive power.

Causal Inference on Multivariate and Mixed-Type Data
Alexander Marx (Max-Planck-Institut for Informatics); Jilles Vreeken (Max Planck Institute for Informatics and Saarland University)
Time: Thursday, 12.20

How can we discover whether X causes Y, or vice versa, that Y causes X, when we are only given a sample over their joint distribution? How can we do this such that X and Y can be univariate, multivariate, or of different cardinalities? And, how can we do so regardless of whether X and Y are of the same, or of different data type, be it discrete, numeric, or mixed? These are exactly the questions we answer in this paper. We take an information theoretic approach, based on the Minimum Description Length principle, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction. Simply put, if Y can be explained more succinctly by a set of classification or regression trees conditioned on X, than in the opposite direction, we conclude that X causes Y. Empirical evaluation on a wide range of data shows that our method, Crack, infers the correct causal direction reliably and with high accuracy on a wide range of settings, outperforming the state of the art by a wide margin.

[Add paper to your calendar]  [Download paper] Reproducible Research

Session: Adversarial Learning
Location: Hogan Mezz 2
Time: Thursday, 11:00

Image Anomaly Detection with Generative Adversarial Networks
Lucas Deecke (University of Edinburgh); Robert Vandermeulen (TU Kaiserslautern); Lukas Ruff (HPI); Stephan Mandt (Disney Research); Marius Kloft (TU Kaiserslautern)
Time: Thursday, 11.00

Many anomaly detection methods exist that perform well on low-dimensional problems however there is a notable lack of effective methods for high-dimensional spaces, such as images. Inspired by recent successes in deep learning we propose a novel approach to anomaly detection using generative adversarial networks. Given a sample under consideration, our method is based on searching for a good representation of that sample in the latent space of the generator; if such a representation is not found, the sample is deemed anomalous. We achieve state-of-the-art performance on standard image benchmark datasets and visual inspection of the most anomalous samples reveals that our method does indeed return anomalies.

Image-to-Markup Generation via Paired Adversarial Learning
Jin-Wen Wu (Institute of Automation of Chinese Academy of Sciences); Fei yin (Institute of Automation of Chinese Academy of Sciences); Yanming Zhang (Institute of Automation of Chinese Academy of Sciences); Xu-Yao Zhang (Institute of Automation of Chinese Academy of Sciences); Cheng-lin liu (Institute of Automation of Chinese Academy of Sciences)
Time: Thursday, 11.20

Motivated by the fact that humans can grasp semantic-invariant features shared by the same category while attention-based models focus mainly on discriminative features of each object, we propose a scalable paired adversarial learning (PAL) method for image-to-markup generation. PAL can incorporate the prior knowledge of standard templates to guide the attention-based model for discovering semantic-invariant features when the model pays attention to regions of interest. Furthermore, we also extend the convolutional attention mechanism to speed up the image-to-markup parsing process while achieving competitive performance compared with recurrent attention models. We evaluate the proposed method in the scenario of handwritten-image-to-LaTeX generation, i.e., converting handwritten mathematical expressions to LaTeX. Experimental results show that our method can significantly improve the generalization performance over standard attention-based encoder-decoder models.

Toward an Understanding of Adversarial Examples in Clinical Trials
Konstantinos Papangelou (The University of Manchester); Konstantinos Sechidis (The University of Manchester); James Weatherall (AstraZeneca); Gavin Brown (The University Of Manchester)
Time: Thursday, 11.40

Deep learning systems can be fooled by small, worst-case perturbations of their inputs, known as adversarial examples. This has been almost exclusively studied in supervised learning, on vision tasks. However, adversarial examples in counterfactual modelling, which sits outside the traditional supervised scenario, is an overlooked challenge. We introduce the concept of adversarial patients, in the context of counterfactual models for clinical trials-this turns out to introduce several new dimensions to the literature. We describe how there exist multiple types of adversarial example-and demonstrate different consequences, e.g. ethical, when they arise. The study of adversarial examples in this area is rich in challenges for accountability and trustworthiness in ML-we highlight future directions that may be of interest to the community.

ShapeShifter: Robust Physical Adversarial Attack on Faster R-CNN Object Detector
Shang-Tse Chen (Georgia Institute of Technology); Cory Cornelius (Intel Corporation); Jason Martin (Intel Corporation); Duen Horng Chau (Georgia Institute of Technology)
Time: Thursday, 12.00

Given the ability to directly manipulate image pixels in the digital input space, an adversary can easily generate imperceptible perturbations to fool a Deep Neural Network (DNN) image classifier, as demonstrated in prior work. In this work, we propose ShapeShifter, an attack that tackles the more challenging problem of crafting physical adversarial perturbations to fool image-based object detectors like Faster R-CNN. Attacking an object detector is more difficult than attacking an image classifier, as it needs to mislead the classification results in multiple bounding boxes with different scales. Extending the digital attack to the physical world adds another layer of difficulty, because it requires the perturbation to be robust enough to survive real-world distortions due to different viewing distances and angles, lighting conditions, and camera limitations. We show that the Expectation over Transformation technique, which was originally proposed to enhance the robustness of adversarial perturbations in image classification, can be successfully adapted to the object detection setting. ShapeShifter can generate adversarially perturbed stop signs that are consistently mis-detected by Faster R-CNN as other objects, posing a potential threat to autonomous vehicles and other safety-critical computer vision systems.

[Add paper to your calendar]  [Download paper] Reproducible Research

Group Anomaly Detection using Deep Generative Models
Edward Toth (University of Sydney); RAGHAVENDRA CHALAPATHY (University of Sydney); Sanjay Chawla (QCRI)
Time: Thursday, 12.20

Unlike conventional anomaly detection research that focuses on point anomalies, our goal is to detect anomalous collections of individual data points. In particular, we perform group anomaly detection (GAD) with an emphasis on irregular group distributions (e.g. irregular mixtures of image pixels). GAD is an important task in detecting unusual and anomalous phenomena in real-world applications such as high energy particle physics, healthcare, social media and medical imaging. In this paper, we take a generative approach by proposing deep generative models: Adversarial autoencoder (AAE) and Variational autoencoder (VAE) for group anomaly detection. Both AAE and VAE detect group anomalies using point-wise input data where group memberships are known a priori. We conduct extensive experiments to evaluate our models on real world datasets. The empirical results demonstrate that our approach is effective and robust in detecting group anomalies.

Session: Evaluation
Location: Davin
Time: Thursday, 11:00

Evaluation procedures for forecasting with spatio-temporal data
Mariana Oliveira (FCUP / INESC TEC); Luis Torgo (Dalhousie University); Vitor Santos Costa (UP)
Time: Thursday, 11.00

The amount of available spatio-temporal data has been increasing as large-scale data collection (e.g., from geosensor networks) becomes more prevalent. This has led to an increase in spatio-temporal forecasting applications using geo-referenced time series data motivated by important domains such as environmental monitoring (e.g., air pollution index, forest fire risk prediction). Being able to properly assess the performance of new forecasting approaches is fundamental to achieve progress. However, the dependence between observations that the spatio-temporal context implies, besides being challenging in the modelling step, also raises issues for performance estimation as indicated by previous work. In this paper, we empirically compare several variants of cross-validation (CV) and out-of-sample (OOS) performance estimation procedures that respect data ordering, using both artificially generated and real-world spatio-temporal data sets. Our results show both CV and OOS reporting useful estimates. Further, they suggest that blocking may be useful in addressing CV’s bias to underestimate error. OOS can be very sensitive to test size, as expected, but estimates can be improved by careful management of the temporal dimension in training.

[Add paper to your calendar]  [Download paper] Reproducible Research

Efficient estimation of AUC in a sliding window
Nikolaj Tatti (Aalto University)
Time: Thursday, 11.20

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length k. More specifically, we propose an algorithm that, given epsilon, estimates AUC within epsilon / 2, and can maintain this estimate in O((log k) / \epsilon) time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires O(k) time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (i) the groups are small enough, so that the error stays small, (ii) the number of groups is small, so that enumerating them is not expensive, and (iii) the definition is flexible enough so that we can maintain the groups efficiently. Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee epsilon / 2, and that we can achieve significant speed-ups with only a modest sacrifice in accuracy.

[Add paper to your calendar]  [Download paper] Reproducible Research

A Blended Metric for Multi-label Optimisation and Evaluation
Laurence Park (Western Sydney University); Jesse Read (Ecole Polytechnique)
Time: Thursday, 11.40

In multi-label classification, a large number of evaluation metrics exist, for example Hamming loss, exact match, and Jaccard similarity — but there are many more. In fact, there remains an apparent uncertainty in the multi-label literature about which metrics should be considered and when and how to optimise them. This has given rise to a proliferation of metrics, with some papers carrying out empirical evaluations under 10 or more different metrics in order to analyse method performance. We argue that further understanding of underlying mechanisms is necessary. In this paper we tackle the challenge of having a clearer view of evaluation strategies. We present a blended loss function. This function allows us to evaluate under the properties of several major loss functions with a single parameterisation. Furthermore we demonstrate the successful use of this metric as a surrogate loss for other metrics. We offer experimental investigation and theoretical backing to demonstrate that optimising this surrogate loss offers best results for several different metrics than optimising the metrics directly. It simplifies and provides insight to the task of evaluating multi-label prediction methodologies.

[Add paper to your calendar]  [Download paper] Reproducible Research

Visualizing the Feature Importance for Black Box Models
Giuseppe Casalicchio (University LMU); Bernd Bischl (LMU); Christoph Molnar (LMU)
Time: Thursday, 12.00

In recent years, a large amount of model-agnostic methods to improve the transparency, trustability and interpretability of machine learning models have been developed. Based on a recent method for model-agnostic global feature importance, we introduce a local feature importance measure for individual observations and propose two visual tools: partial importance (PI) and individual conditional importance (ICI) plots which visualize how changes in a feature affect the model performance on average, as well as for individual observations. Our proposed methods are related to partial dependence (PD) and individual conditional expectation (ICE) plots, but visualize the expected (conditional) feature importance instead of the expected (conditional) prediction. Furthermore, we show that averaging ICI curves across observations yields a PI curve, and integrating the PI curve with respect to the distribution of the considered feature results in the global feature importance. Another contribution of our paper is the Shapley feature importance, which fairly distributes the overall performance of a model among the features according to the marginal contributions and which can be used to compare the feature importance across different models.

[Add paper to your calendar]  [Download paper] Reproducible Research

Controlling and visualizing the precision-recall tradeoff for external performance indices
Blaise HANCZAR (Université d Evry); Mohamed Nadif (LIPADE – Université Paris Descartes)
Time: Thursday, 12.20

In many machine learning problems, the performance of the results is measured by indices that often combine precision and recall. In this paper, we study the behavior of such indices in function of the tradeoff precision-recall. We present a new tool of performance visualization and analysis referred to the tradeoff space, which plots the performance index in function of the precision-recall tradeoff. We analyse the properties of this new space and show its advantages over the precision-recall space.

Session: ADS E-commerce
Location: Nally
Time: Thursday, 11:00

Discovering Bayesian Market Views for Intelligent Asset Allocation
Frank Xing (Nanyang Technological University); Erik Cambria (SENTIC); Lorenzo Malandri (Polimi); Carlo Vercellis (Polimi)
Time: Thursday, 11.00

Along with the advance of opinion mining techniques, public mood has been found to be a key element for stock market prediction. However, how market participants’ behavior is affected by public mood has been rarely discussed. Consequently, there has been little progress in leveraging public mood for the asset allocation problem, which is preferred in a trusted and interpretable way. In order to address the issue of incorporating public mood analyzed from social media, we propose to formalize public mood into market views, because market views can be integrated into the modern portfolio theory. In our framework, the optimal market views will maximize returns in each period with a Bayesian asset allocation model. We train two neural models to generate the market views, and benchmark the model performance on other popular asset allocation strategies. Our experimental results suggest that the formalization of market views significantly increases the profitability (5% to 10% annually) of the simulated portfolio at a given risk level.

Speeding up the Metabolism in E-commerce by Reinforcement Mechanism Design
Hualin He (Alibaba Group); Chun-Xiang Pan (Alibaba company); Qing Da (Alibaba Group); An-Xiang Zeng (Alibaba)
Time: Thursday, 11.20

In a large E-commerce platform, all the participants compete for impressions under the allocation mechanism of the platform. Existing methods mainly focus on the short-term return based on the current observations instead of the long-term return. In this paper, we formally establish the lifecycle model for products, by defining the introduction, growth, maturity and decline stages and their transitions throughout the whole life period. Based on such model, we further propose a reinforcement learning based mechanism design framework for impression allocation, which incorporates the first principal component based permutation and the novel experiences generation method, to maximize short-term as well as long-term return of the platform. With the power of trial-and-error, it is possible to recognize in advance the potentially hot products in the introduction stage as well as the potentially slow-selling products in the decline stage, so the metabolism can be speeded up by an optimal impression allocation strategy. We evaluate our algorithm on a simulated environment built based on one of the largest E-commerce platforms, and a significant improvement has been achieved in comparison with the baseline solutions.

Intent-aware Audience Targeting for Ride-hailing Service
yuan xia (baidu); jingbo zhou (baidu)
Time: Thursday, 11.40

As the market for ride-hailing service is increasing dramatically, an efficient audience targeting system (which aims to identify a group of recipients for a particular message) for ride-hailing services is demanding for marketing campaigns. In this paper, we describe the details of our deployed system for intent-aware audience targeting on Baidu Maps for ride-hailing services. The objective of the system is to predict user intent for requesting a ride and then send corresponding coupons to the user. For this purpose, we develop a hybrid model to combine the LSTM model and GBDT model together to handle sequential map query data and heterogeneous non-sequential data, which leads to a significant improvement in the performance of the intent prediction. We verify the effectiveness of our method over a large real-world dataset and conduct a large-scale online marketing campaign over Baidu Maps app. We present an in-depth analysis of the model’s performance and trade-offs. Both offline experiment and online marketing campaign evaluation show that our method has a consistently good performance in predicting user intent for a ride request and can significantly increase the click-through rate (CTR) of vehicle coupon targeting compared with baseline methods.

A Practical Deep Online Ranking System in E-commerce Recommendation
Yan Yan (JD.com); Zitao Liu (TAL AI Lab); Meng Zhao (JD.com); Wentao Guo (JD.com); Weipeng Yan (JD.com); Yongjun Bao (JD.com)
Time: Thursday, 12.00

User online shopping experience in modern e-commerce websites critically relies on real-time personalized recommendations. However, building a productionized recommender system still remains challenging due to a massive collection of items, a huge number of online users, and requirements for recommendations to be responsive to user actions. In this work, we present our relevant, responsive, and scalable deep online ranking system (DORS) that we developed and deployed in our company. DORS is implemented in a three-level architecture which includes (1) candidate retrieval that retrieves a board set of candidates with various business rules enforced; (2) deep neural network ranking model that takes advantage of available user and item specific features and their interactions; (3) multi-arm bandits based online re-ranking that dynamically takes user real-time feedback and re-ranks the final recommended items in scale. Given a user as a query, DORS is able to precisely capture users’ real-time purchasing intents and help users reach to product purchases. Both offline and online experimental results show that DORS provides more personalized online ranking results and makes more revenue.

A Recurrent Neural Network Survival Model: Predicting Web User Return Time
Georg L. Grob (Imperial College London); Ângelo Cardoso (ASOS.com); C. H. Bryan Liu (ASOS.com); Duncan A. Little (ASOS.com); Benjamin Paul Chamberlain (Imperial College London; ASOS.com)
Time: Thursday, 12.20

The size of a website’s active user base directly affects its value. Thus, it is important to monitor and influence a user’s likelihood to return to a site. Essential to this is predicting when a user will return. Current state of the art approaches to solve this problem come in two flavors: (1) Recurrent Neural Network (RNN) based solutions and (2) survival analysis methods. We observe that both techniques are severely limited when applied to this problem. Survival models can only incorporate aggregate representations of users instead of automatically learning a representation directly from a raw time series of user actions. RNNs can automatically learn features, but can not be directly trained with examples of non-returning users who have no target value for their return time. We develop a novel RNN survival model that removes the limitations of the state of the art methods. We demonstrate that this model can successfully be applied to return time prediction on a large e-commerce dataset with a superior ability to discriminate between returning and non-returning users than either method applied in isolation.

Implicit Linking of Food Entities in Social Media
Wen-Haw Chong (Singapore Management University); Ee-peng Lim (Singapore Management University)
Time: Thursday, 12.40

Dining is an important part in people’s lives and this explains why food-related microblogs and reviews are popular in social media. Identifying food entities in food-related posts is important to food lover profiling and food (or restaurant) recommendations. In this work, we conduct Implicit Entity Linking (IEL) to link food-related posts to food entities in a knowledge base. In IEL, we link posts even if they do not contain explicit entity mentions. We first show empirically that food venues are entity-focused and associated with a limited number of food entities each. Hence same-venue posts are likely to share common food entities. Drawing from these findings, we propose an IEL model which incorporates venue-based query expansion of test posts and venue-based prior distributions over entities. In addition, our model assigns larger weights to words that are more indicative of entities. Our experiments on Instagram captions and food reviews shows our proposed model to outperform competitive baselines.

Session: Recommender Systems
Location: Hogan
Time: Thursday, 14:00

Learning Multi-granularity Dynamic Network Representations for Social Recommendation
Peng Liu (Norwegian University of Science and Technology); Lemei Zhang (Norwegian University of Science and Technology); Jon Atle Gulla (Norwegian University of Science and Technology)
Time: Thursday, 14.00

With the rapid proliferation of online social networks, personalized social recommendation has become an important means to help people discover useful information over time. However, the cold-start issue and the special properties of social networks, such as rich temporal dynamics, heterogeneous and complex structures with millions of nodes, render the most commonly used recommendation approaches (e.g. Collaborative Filtering) inefficient. In this paper, we propose a novel multi-granularity dynamic network embedding (m-DNE) model for the social recommendation which is capable of recommending relevant users and interested items. In order to support online recommendation, we construct a heterogeneous user-item (HUI) network and incrementally maintain it as the social network evolves. m-DNE jointly captures the temporal semantic effects, social relationships and user behavior sequential patterns in a unified way by embedding the HUI network into a shared low dimensional space. Meanwhile, multi-granularity proximities which include the second-order proximity and the community-aware high-order proximity of nodes, are introduced to learn more informative and robust network representations. Then, with an efficient search method, we use the encoded representation of temporal contexts to generate recommendations. Experimental results on several real large-scale datasets show its advantages over other state-of-the-art methods.

GeoDCF: Deep Collaborative Filtering with Multifaceted Contextual Information in Location-based Social Networks
Dimitrios Rafailidis (University of Mons); Fabio Crestani (Universita della Svizzera italiana)
Time: Thursday, 14.20

In this study we investigate the recommendation problem with multifaceted contextual information to overcome the scarcity of users’ check-in data in Location-based Social Networks. To generate accurate personalized Point-of-Interest (POI) recommendations in the presence of data scarcity, we account for both users’ and POIs’ contextual information such as the social influence of friends, as well as the geographical and sequential transition influence of POIs on user’s check-in behavior. We first propose a multi-view learning strategy to capture the multifaceted contextual information of users and POIs along with users’ check-in data. Then, we feed the learned user and POI latent vectors to a deep neural framework, to capture their non-linear correlations. Finally, we formulate the objective function of our geo-based deep collaborative filtering model (GeoDCF) as a Bayesian personalized ranking problem to focus on the top-k recommendation task and we learn the parameters of our model via backpropagation. Our experiments on real-world datasets confirm that GeoDCF achieves high recommendation accuracy, significantly outperforming other state-of-the-art methods. Furthermore, we confirm the influence of both users’ and POIs’ contextual information on our GeoDCF model.

POLAR: Attention-based CNN for One-shot Personalized Article Recommendation
Zhengxiao Du (Tsinghua University); Jie Tang (Tsinghua University); Yuhui Ding (Tsinghua University)
Time: Thursday, 14.40

In this paper, we propose POLAR, an attention-based CNN combined with one-shot learning for personalized article recommendation. Given a query, POLAR uses an attention-based CNN to estimate the relevance score between the query and related articles. The attention mechanism can help significantly improve the relevance estimation. For example, on AMiner, this can help achieve a +5.0% improvement in terms of NDCG@3. One more challenge in personalized article recommendation is how to collect statistically sufficient training data for a recommendation model. POLAR combines a one-shot learning function into the recommendation model, which further gains significant improvements. For example, on AMiner, with only 1.6 feedbacks on average, POLAR achieves 2.7% improvement by NDCG@3. We evaluate the proposed POLAR on three different datasets: AMiner, Patent, and RARD. Experimental results demonstrate the effectiveness of the proposed model. Recently, we have successfully deployed POLAR into AMiner as the recommendation engine for personalized article recommendation, which further confirms the effectiveness of the proposed model.

Inferring Continuous Latent Preference on Transition Intervals for Next Point-of-Interest Recommendation
Jing He (Beijing Institute of Technology); Xin Li (Beijing Institute of Technology); Lejian Liao (School of Computer Science, Beijing Institute of Technology); Mingzhong Wang (The University of the Sunshine Coast)
Time: Thursday, 15.00

Temporal information plays an important role in Point-of-Interest (POI) recommendations. Most existing studies model the temporal influence by utilizing the observed check-in time stamps explicitly. With the conjecture that transition intervals between successive check-ins may carry more information for diversified behavior patterns, we propose a probabilistic factor analysis model to incorporate three components, namely, personal preference, distance preference, and transition interval preference. They are modeled by an observed third-rank transition tensor, a distance constraint, and a continuous latent variable, respectively. In our framework, the POI recommendation and the transition interval for user’s very next move can be inferred simultaneously by maximizing the posterior probability of the overall transitions. Expectation Maximization (EM) algorithm is used to tune the model parameters. We demonstrate that the proposed methodology achieves substantial gains in terms of prediction on next move and its expected time over state-of-the-art baselines.

Personalized Thread Recommendation for MOOC Discussion Forums
Andrew Lan (Princeton University); Jonathan Spencer (Princeton University); Ziqi Chen (HKUST); Christopher Brinton (Zoomi Inc.); Mung Chiang (Purdue University)
Time: Thursday, 15.20

Social learning, i.e., students learning from each other through social interactions, has the potential to significantly scale up instruction in online education. In many cases, such as in massive open online courses (MOOCs), social learning is facilitated through discussion forums hosted by course providers. In this paper, we propose a probabilistic model for the process of learners posting on such forums, using point processes. Different from existing works, our method integrates topic modeling of the post text, timescale modeling of the decay in post excitation over time, and learner topic interest modeling into a single model, and infers this information from user data. Our method also varies the excitation levels induced by posts according to the thread structure, to reflect typical notification settings in discussion forums. We experimentally validate the proposed model on three real-world MOOC datasets, with the largest one containing up to 6,000 learners making 40,000 posts in 5,000 threads. Results show that our model excels at thread recommendation, achieving significant improvement over a number of baselines, thus showing promise of being able to direct learners to threads that they are interested in more efficiently. Moreover, we demonstrate analytics that our model parameters can provide, such as the timescales of different topic categories in a course.

Session: Deep Learning 3
Location: Hogan Mezz 1
Time: Thursday, 14:00

Using Supervised Pretraining to Improve Generalization of Neural Networks on Binary Classification Problems
Yuxuan(Alex) Peng (University of Auckland); Yun Sing Koh (The University of Auckland, New Zealand); Patricia Riddle (University of Auckland, New Zealand); Bernhard Pfahringer (University of Waikato)
Time: Thursday, 14.00

Neural networks are known to be very sensitive to the initial weights. There has been a lot of research on initialization that aims to stabilize the training process. However, very little research has studied the relationship between initialization and generalization. We demonstrate that poorly initialized model will lead to lower test accuracy. We propose a supervised pretraining technique that helps improve generalization on binary classification problems. The experimental results on four UCI datasets show that the proposed pretraining method leads to higher test accuracy compared to the he_normal initialization when the training set is small. In further experiments on synthetic data, the improvement on test accuracy using the proposed pretraining reaches more than 30% when the data has high dimensionality and noisy features.

[Add paper to your calendar]  [Download paper] Reproducible Research

MaxGain: Regularisation of Neural Networks by Constraining Activation Magnitudes
Henry Gouk (University of Waikato); Bernhard Pfahringer (University of Waikato); Eibe Frank (University of Waikato); Michael Cree (University of Waikato)
Time: Thursday, 14.20

Effective regularisation of neural networks is essential to combat overfitting due to the large number of parameters involved. We present an empirical analogue to the Lipschitz constant of a feed-forward neural network, which we refer to as the maximum gain. We hypothesise that constraining the gain of a network will have a regularising effect, similar to how constraining the Lipschitz constant of a network has been shown to improve generalisation. A simple algorithm is provided that involves rescaling the weight matrix of each layer after each parameter update. We conduct a series of studies on common benchmark datasets, and also a novel dataset that we introduce to enable easier significance testing for experiments using convolutional networks. Performance on these datasets compares favourably with other common regularisation techniques.

Ontology alignment based on word embedding and random forest classification
Ikechukwu Nkisi-Orji (Robert Gordon University)
Time: Thursday, 14.40

Ontology alignment is crucial for integrating heterogeneous data sources and forms an important component of the semantic web. Accordingly, several ontology alignment techniques have been proposed and used for discovering correspondences between the concepts (or entities) of different ontologies. Most alignment techniques depend on string-based similarities which are unable to handle the vocabulary mismatch problem. Also, determining which similarity measures to use and how to effectively combine them in alignment systems are challenges that have persisted in this area. In this work, we introduce a random forest classifier approach for ontology alignment which relies on word embedding for determining a variety of semantic similarities features between concepts. Specifically, we combine string-based and semantic similarity measures to form feature vectors that are used by the classifier model to determine when concepts match. By harnessing background knowledge and relying on minimal information from the ontologies, our approach can handle knowledge-light ontological resources. It also eliminates the need for learning the aggregation weights of a composition of similarity measures. Experiments using Ontology Alignment Evaluation Initiative (OAEI) dataset and real-world ontologies highlight the utility of our approach and show that it can outperform state-of-the-art alignment systems.

Joint autoencoders: a flexible meta-learning framework
Baruch Epstein (Technion Israeli Institute of Technology); Ron Meir (Technion Israeli Institute of Technology ); Tomer Michaeli (Technion)
Time: Thursday, 15.00

We develop a framework for learning multiple tasks simultaneously, based on sharing features that are common to all tasks, achieved through the use of a modular deep feedforward neural network consisting of shared branches, dealing with the common features of all tasks, and private branches, learning the specific unique aspects of each task. Once an appropriate weight sharing architecture has been established, learning takes place through standard algorithms for feedforward networks, e.g., stochastic gradient descent and its variations. The method deals with meta-learning (such as domain adaptation, transfer and multi-task learning) in a unified fashion, and can deal with data arising from different modalities. Numerical experiments demonstrate the effectiveness of learning in domain adaptation and transfer learning setups, and provide evidence for the flexible and task-oriented representations arising in the network. In particular, we handle transfer learning between multiple tasks in a straightforward manner, as opposed to many competing state-of-the-art methods, that are unable to handle more than two tasks. We also illustrate the network’s ability to distill task-specific and shared features.

Parametric t-Distributed Stochastic Exemplar-centered Embedding
Martin Renqiang Min (NEC Labs America Princeton); Hongyu Guo (National Research Council Canada); Dinghan Shen (Duke University)
Time: Thursday, 15.20

Parametric embedding methods such as parametric t-distributed Stochastic Neighbor Embedding (pt-SNE) enables out-of-sample data visualization without further computationally expensive optimization or approximation. However, pt-SNE favors small mini-batches to train a deep neural network but large mini-batches to approximate its cost function involving all pairwise data point comparisons, and thus has difficulty in finding a balance. To resolve the conflicts, we present parametric t-distributed stochastic exemplar-centered embedding. Our strategy learns embedding parameters by comparing training data only with precomputed exemplars to indirectly preserve local neighborhoods, resulting in a cost function with significantly reduced computational and memory complexity. Moreover, we propose a shallow embedding network with high-order feature interactions for data visualization, which is much easier to tune but produces comparable performance in contrast to a deep feedforward neural network employed by pt-SNE. We empirically demonstrate, using several benchmark datasets, that our proposed method significantly outperforms pt-SNE in terms of robustness, visual effects, and quantitative evaluations.

Session: Matrix and tensor Analysis
Location: Hogan Mezz 2
Time: Thursday, 14:00

Block CUR: Decomposing Matrices using Groups of Columns
Urvashi K Oswal (UW-Madison); Swayambhoo Jain (Technicolor Research); Kevin Xu (University of Toledo); Brian Eriksson (Adobe)
Time: Thursday, 14.00

A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined \emph{blocks} of columns (or rows) from the matrix. We present a simple algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. This algorithm has application in problems as diverse as biometric data analysis to distributed computing. This regime is commonly found when data is distributed across multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. We demonstrate the effectiveness of the proposed algorithms for computing the block CUR decomposition of large matrices in a distributed setting using Apache Spark. In the biometric data analysis case, a matrix of biometric data is collected from users reacting to external stimuli, {\em e.g.,}~watching video content. In this setting, rows correspond to different users and columns correspond to users’ biometric reaction at a particular time instant. There is significant cost in acquiring each user’s reaction to lengthy content so we sample a few important scenes to approximate the biometric response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments ({\em i.e.,} blocks) must be presented to the user. The practical application of these algorithms is shown via experimental results using real-world user biometric data from a content testing environment.

Identifying and Alleviating Concept Drift in Streaming Tensor Decomposition
Ravdeep Pasricha (University of California, Riverside); Ekta Gujral (University of California, Riverside); Evangelos Papalexakis (UC Riverside)
Time: Thursday, 14.20

Tensor decompositions are used in various data mining applications from social network to medical applications and are extremely useful in discovering latent structures or concepts in the data. Many real world applications are dynamic in nature and so are their data. To deal with this dynamic nature of data, there exist a variety of online tensor decomposition algorithms. A central assumption in all those algorithms is that the number of latent concepts remains fixed throughout the entire stream. However, this need not be the case. Every incoming batch in the stream may have a different number of latent concepts, and the difference in latent concepts from one tensor batch to another can provide insights into how our findings in a particular application behave and deviate over time. In this paper, we define “concept” and “concept drift” in the context of streaming tensor decomposition, as the manifestation of the variability of latent concepts throughout the stream. Furthermore, we introduce SeekAndDestroy, an algorithm that detects concept drift in streaming tensor decomposition and is able to produce results robust to that drift. To the best of our knowledge, this is the first work that investigates concept drift in streaming tensor decomposition. We extensively evaluate SeekAndDestroy on synthetic datasets, which exhibit a wide variety of realistic drift. Our experiments demonstrate the effectiveness of SeekAndDestroy, both in the detection of concept drift and in the alleviation of its effects, producing results with similar quality to decomposing the entire tensor in one shot. Additionally, in real datasets, SeekAndDestroy outperforms other streaming baselines, while discovering novel useful components.

[Add paper to your calendar]  [Download paper] Reproducible Research

Lambert Matrix Factorization
Arto Klami (University of Helsinki); Jarkko Lagus (University of Helsinki); Joseph Sakaya (University of Helsinki)
Time: Thursday, 14.40

Many data generating processes result in skewed data, which should be modelled by distributions that can capture the skewness. In this work we adopt the flexible family of Lambert W distributions that combine arbitrary standard distribution with specific nonlinear transformation to incorporate skewness. We describe how Lambert W distributions can be used in probabilistic programs by providing stable gradient-based inference, and demonstrate their use in matrix factorization. In particular, we focus in modeling logarithmically transformed count data. We analyze the weighted squared loss used by state-of-the-art word embedding models to learn interpretable representations from word co-occurrences and show that a generative model capturing the essential properties of those models can be built using Lambert W distributions.

MASAGA: A Linearly-Convergent Stochastic First-Order Method for Optimization on Manifolds
Reza Babanezhad (University of British Columbia); Issam Hadj Laradji (University of British Columbia (UBC)); Alireza Shafaei (The University of British Columbia); Mark Schmidt (University of British Columbia)
Time: Thursday, 15.00

We consider the stochastic optimization of finite sums over a Riemannian manifold where the functions are smooth and convex. We present MASAGA, an extension of the stochastic average gradient variant SAGA on Riemannian manifolds. SAGA is a variance-reduction technique that typically outperforms methods that rely on expensive full-gradient calculations, such as the stochastic variance-reduced gradient method. We show that MASAGA achieves a linear convergence rate with uniform sampling, and we further show that MASAGA achieves a faster convergence rate with non-uniform sampling. Our experiments show that MASAGA is faster than the recent Riemannian stochastic gradient descent algorithm for the classic problem of finding the leading eigenvector corresponding to the maximum eigenvalue.

A Distributed Frank-Wolfe Framework for Learning Low-Rank Matrices with the Trace Norm
Wenjie Zheng (Sorbonne Universités), Aurélien Bellet (INRIA), Patrick Gallinar (Sorbonne Universités)
Time: Thursday, 15.20

We consider the problem of learning a high-dimensional but low-rank matrix from a large-scale dataset distributed over several machines, where low-rankness is enforced by a convex trace norm constraint. We propose DFW-Trace, a distributed Frank-Wolfe algorithm which leverages the low-rank structure of its updates to achieve efficiency in time, memory and communication usage. The step at the heart of DFW-Trace is solved approximately using a distributed version of the power method. We provide a theoretical analysis of the convergence of DFW-Trace, showing that we can ensure sublinear convergence in expectation to an optimal solution with few power iterations per epoch. We implement DFW-Trace in the Apache Spark distributed programming framework and validate the usefulness of our approach on synthetic and real data, including the ImageNet dataset with high-dimensional features extracted from a deep neural network.

Session: Time Series
Location: Davin
Time: Thursday, 14:00

Temporal Stability in Predictive Process Monitoring
Irene Teinemaa (University of Tartu), Marlon Dumas (University of Tartu), Anna Leontjeva (University of Tartu), Fabrizio Maria Maggi (University of Tartu)
Time: Thursday, 14.00

Predictive process monitoring is concerned with the analysis of events produced during the execution of a business process in order to predict as early as possible the final outcome of an ongoing case. Traditionally, predictive process monitoring methods are optimized with respect to accuracy. However, in environments where users make decisions and take actions in response to the predictions they receive, it is equally important to optimize the stability of the successive predictions made for each case. To this end, this paper defines a notion of temporal stability for binary classification tasks in predictive process monitoring and evaluates existing methods with respect to both temporal stability and accuracy. We find that methods based on XGBoost and LSTM neural networks exhibit the highest temporal stability. We then show that temporal stability can be enhanced by hyperparameter-optimizing random forests and XGBoost classifiers with respect to inter-run stability. Finally, we show that time series smoothing techniques can further enhance temporal stability at the expense of slightly lower accuracy.

ParCorr: Efficient Parallel Methods to Identify Similar Time Series Pairs across Sliding Windows
Djamel Edine Yagoubi (Inria & LIRMM), Reza Akbarinia (Inria & LIRMM), Boyan Kolev (Inria & LIRMM), Oleksandra Levchenko (Inria & LIRMM), Florent Masseglia (Inria & LIRMM), Patrick Valduriez (Inria & LIRMM), Dennis Shasha (New York University)
Time: Thursday, 14.20

Consider the problem of finding the highly correlated pairs of time series over a time window and then sliding that window to find the highly correlated pairs over successive co-temporous windows such that each successive window starts only a little time after the previous window. Doing this efficiently and in parallel could help in applications such as sensor fusion, financial trading, or communications network monitoring, to name a few. We have developed a parallel incremental random vector/sketching approach to this problem and compared it with the state-of-the-art nearest neighbor method iSAX. Whereas iSAX achieves 100% recall and precision for Euclidean distance, the sketching approach is, empirically, at least 10 times faster and achieves 95% recall and 100% precision on real and simulated data. For many applications this speedup is worth the minor reduction in recall. Our method scales up to 100 million time series and scales linearly in its expensive steps (but quadratic in the less expensive ones).

Constructive Aggregation and its application to forecasting with dynamic ensembles
Vitor Cerqueira (U. Porto); Fábio Pinto (U. Porto ); Luis Torgo (Dalhousie University); Carlos Soares (INESC TEC); Nuno Moniz (INESC TEC)
Time: Thursday, 14.40

While the predictive advantage of ensemble methods is nowadays widely accepted, the most appropriate way of estimating the weights of each individual model remains an open research question. Meanwhile, several studies report that combining different ensemble approaches leads to improvements in performance, due to a better trade-off between the diversity and the error of the individual models in the ensemble. We contribute to this research line by proposing an aggregation framework for a set of independently created forecasting models, i.e. heterogeneous ensembles. The general idea is to, instead of directly aggregating these models, first rearrange them into different subsets, creating a new set of combined models which is then aggregated into a final decision. We present this idea as constructive aggregation, and apply it to time series forecasting problems. Results from empirical experiments show that applying constructive aggregation to state of the art dynamic aggregation methods provides a consistent advantage. Constructive aggregation is publicly available in a software package.

[Add paper to your calendar]  [Download paper] Reproducible Research

Time warp invariant dictionary learning for time series clustering: application to music data stream analysis
Saeed Varasteh Yazdi (Laboratoire d’Informatique de Grenoble); Ahlame Douzal Chouakria (CNRS); Patrick Gallinari (LIP6, Sorbonne Universite); Manuel Moussalam (DEEZER)
Time: Thursday, 15.00

This work proposes a time warp invariant sparse coding and dictionary learning framework for time series clustering, where both input samples and atoms define time series of different lengths that involve variable delays. For that, first an l0 sparse coding problem is formalised and a time warp invariant orthogonal matching pursuit based on a new cosine maximisation time warp operator is proposed. A dictionary learning under time warp is then formalised and a gradient descent solution is developed. Lastly, a time series clustering based on the time warp sparse coding and dictionary learning is presented. The proposed approach is evaluated and compared to major alternative methods on several public datasets, with an application to Deezer music data stream clustering.

Exploring Variable-Length Time Series Motifs in One Hundred Million Length Scale
Yifeng Gao (George Mason University), Jessica Lin (George Mason University)
Time: Thursday, 15.20

The exploration of repeated patterns with different lengths, also called variable-length motifs, has received a great amount of attention in recent years. However, existing algorithms to detect variable-length motifs in large-scale time series are very time-consuming. In this paper, we introduce a time- and space-efficient approximate variable-length motif discovery algorithm, Distance-Propagation Sequitur (DP-Sequitur), for detecting variable-length motifs in large-scale time series data (e.g. over one hundred million in length). The discovered motifs can be ranked by different metrics such as frequency or similarity, and can benefit a wide variety of real-world applications. We demonstrate that our approach can discover motifs in time series with over one hundred million points in just minutes, which is significantly faster than the fastest existing algorithm to date. We demonstrate the superiority of our algorithm over the state-of-the-art using several real world time series datasets.

Session: ADS Sensing / Positioning
Location: Nally
Time: Thursday, 14:00

Combining Bayesian Inference and Clustering for Transport Mode Detection from Sparse and Noisy Geolocation Data
Danya Bachir (IRT SystemX); ghazaleh khodabandelou (paris saclay); vincent gauthier (telecom sudparis); Mounim A. El Yacoubi (Telecom SudParis); eric vachon (bouygues telecom)
Time: Thursday, 14.00

Large-scale real-time transport mode detection is an open challenge for smart transport research. We present the first method to detect transport modes taken by any traveling phone holder. We use anonymous Call Detail Records from the Greater Paris in collaboration with a mobile phone operator. We construct anonymized aggregated trajectories as sequences of mobile network locations, called sectors, where devices were detected. We use Bayesian inference to compute trajectories’ transport modes probabilities. In this perspective, we engineer features using both mobile and transport networks and apply clustering on sectors in order to find transport probabilities given each visited sector. Using unsupervised evaluation metrics, we find 9 clusters best describe the region’s transport usage. We construct 15% sectors labels to estimate clusters’ probabilities. We derive prior distribution parameters from both trajectories and household travel survey. For model validation, we calculate daily average user trips at department scale. We find Pearson correlations with survey above 0.96 for road and rail modes, showing the model is performant and robust to sparse and noisy trajectories.

PBE: Driver Behavior Assessment Beyond Trajectory Profiling
Bing He (University of Macau); Xiaolin Chen (Shenzhen University); Dian Zhang (Shenzhen University); Siyuan Liu (Pennsylvania State University); Dawei Han (China Pacific Insurance Company); Lionel M. NI (University of Macau)
Time: Thursday, 14.20

Nowadays, the increasing car accidents ask for the better driver behavior analysis and risk assessment for travel safety, auto insurance pricing and smart city applications. Traditional approaches largely use GPS data to assess drivers. However, it is difficult to fine-grained assess the time-varying driving behaviors. In this paper, we employ the increasingly popular On-Board Diagnostic (OBD) equipment, which measures semantic-rich vehicle information, to extract detailed trajectory and behavior data for analysis. We propose PBE system, which consists of Trajectory Profiling Model (PM), Driver Behavior Model (BM) and Risk Evaluation Model (EM). PM profiles trajectories for reminding drivers of danger in real-time. The labeled trajectories can be utilized to boost the training of BM and EM for driver risk assessment when data is incomplete. BM evaluates the driving risk using fine-grained driving behaviors on a trajectory level. Its output incorporated with the time-varying pattern, is combined with the driver-level demographic information for the final driver risk assessment in EM. Meanwhile, the whole PBE system also considers the real-world cost-sensitive application scenarios. Extensive experiments on the real-world dataset demonstrate that the performance of PBE in risk assessment outperforms the traditional systems by at least 21%.

Accurate WiFi-based Indoor Positioning with Continuous Location Sampling
Jesper E van Engelen (Leiden University); Jori van Lier (KPMG Advisory N.V.); Frank Takes (Leiden University); Heike Trautmann (University of Münster)
Time: Thursday, 14.40

The ubiquity of WiFi access points and the sharp increase in WiFi-enabled devices carried by humans have paved the way for WiFi-based indoor positioning and location analysis. Locating people in indoor environments has numerous applications in robotics, crowd control, indoor facility optimization, and automated environment mapping. However, existing WiFi-based positioning systems suffer from two major problems: (1) their accuracy and precision is limited due to inherent noise induced by indoor obstacles, and (2) they only occasionally provide location estimates, namely when a WiFi-equipped device emits a signal. To mitigate these two issues, we propose a novel Gaussian process (GP) model for WiFi signal strength measurements. It allows for simultaneous smoothing (increasing accuracy and precision of estimators) and interpolation (enabling continuous sampling of location estimates). Furthermore, simple and efficient smoothing methods for location estimates are introduced to improve localization performance in real-time settings. Experiments are conducted on two data sets from a large real-world commercial indoor retail environment. Results demonstrate that our approach provides significant improvements in terms of precision and accuracy with respect to unfiltered data. Ultimately, the GP model realizes continuous location sampling with consistently high quality location estimates.

Human Activity Recognition with Convolutional Neural Networks
Antonio Bevilacqua (University College Dublin); Kyle MacDonald (UCD); Aamina Rangarej (UCD); Venessa Widjaya (UCD); Brian Caulfield (Insight Centre for Data Analytics); Mohand Tahar Kechadi (University College Dublin)
Time: Thursday, 15.00

The problem of automatic identification of physical activities performed by human subjects is referred to as Human Activity Recognition (HAR). There exist several techniques to measure motion characteristics during these physical activities, such as Inertial Measurement Units (IMUs). IMUs have a cornerstone position in this context, and are characterized by usage flexibility, low cost, and reduced privacy impact. With the use of inertial sensors, it is possible to sample some measures such as acceleration and angular velocity of a body, and use them to learn models that are capable of correctly classifying activities to their corresponding classes. In this paper, we propose to use Convolutional Neural Networks (CNNs) to classify human activities. Our models use raw data obtained from a set of inertial sensors. We explore several combinations of activities and sensors, showing how motion signals can be adapted to be fed into CNNs by using different network architectures. We also compare the performance of different groups of sensors, investigating the classification potential of single, double and triple sensor systems. The experimental results obtained on a dataset of 16 lower-limb activities, collected from a group of participants with the use of five different sensors, are very promising.

Urban sensing for anomalous event detection: Distinguishing between legitimate traffic changes and abnormal traffic variability
Masoomeh Zameni (University of Melbourne)
Time: Thursday, 15.20

Sensors deployed in different parts of a city continuously record traffic data, such as vehicle flows and pedestrian counts. We define an unexpected change in the traffic counts as an anomalous local event. Reliable discovery of such events is very important in real-world applications such as real-time crash detection or traffic congestion detection. One of the main challenges to detecting anomalous local events is to distinguish them from legitimate global traffic changes, which happen due to seasonal effects, weather and holidays. Existing anomaly detection techniques often raise many false alarms for these legitimate traffic changes, making such techniques less reliable. To address this issue, we introduce an unsupervised anomaly detection system that represents relationships between different locations in a city. Our method uses training data to estimate the traffic count at each sensor location given the traffic counts at the other locations. The estimation error is then used to calculate the anomaly score at any given time and location in the network. We test our method on two real traffic datasets collected in the city of Melbourne, Australia, for detecting anomalous local events. Empirical results show the greater robustness of our method to legitimate global changes in traffic count than four benchmark anomaly detection methods examined in this paper.

Session: Learning 3
Location: Hogan
Time: Thursday, 16:00

ML-Plan: Automated Machine Learning via Hierarchical Planning
Felix Mohr (Paderborn University), Marcel Wever (Paderborn University), Eyke Hüllermeier (Paderborn University)
Time: Thursday, 16.00

Automated machine learning (AutoML) seeks to automatically select, compose, and parametrize machine learning algorithms, so as to achieve optimal performance on a given task (data set). Although current approaches to AutoML have already produced impressive results, the field is still far from mature, and new techniques are still being developed. In this paper, we present ML-Plan, a new approach to AutoML based on hierarchical planning. To highlight the potential of this approach, we compare ML-Plan to the state-of-the-art frameworks Auto-WEKA, auto-sklearn, and TPOT. In an extensive series of experiments, we show that ML-Plan is highly competitive and often outperforms existing approaches.

A New Method of Moments for Latent Variable Models
Matteo Ruffini (Universitat Politècnica de Catalunya and BGSMath), Marta Casanellas (Universitat Politècnica de Catalunya and BGSMath), Ricard Gavaldà (Universitat Politècnica de Catalunya and BGSMath)
Time: Thursday, 16.20

We present an algorithm for the unsupervised learning of latent variable models based on the method of moments. We give efficient estimates of the moments for two models that are well known e.g. in text mining, the single-topic model and latent Dirichlet allocation, and we provide a tensor decomposition algorithm for the moments that proves to be robust both in theory and in practice. Experiments on synthetic data show that the proposed estimators outperform the existing ones in terms of reconstruction accuracy, and that the proposed tensor decomposition technique achieves the learning accuracy of the state-of-the-art method with significantly smaller running times. We also provide examples of applications to real-world text corpora for both single-topic model and LDA, obtaining meaningful results.

VC-Dimension Based Generalization Bounds for Relational Learning
Ondrej Kuzelka (KU Leuven); Yuyi Wang (ETH Zurich); Steven Schockaert (Cardiff University)
Time: Thursday, 16.40

In many applications of relational learning, the available data can be seen as a sample from a larger relational structure (e.g. we may be given a small fragment from some social network). In this paper we are particularly concerned with scenarios in which we can assume that (i) the domain elements appearing in the given sample have been uniformly sampled without replacement from the (unknown) full domain and (ii) the sample is complete for these domain elements (i.e. it is the full substructure induced by these elements). Within this setting, we study bounds on the error of sufficient statistics of relational models that are estimated on the available data. As our main result, we prove a bound based on a variant of the Vapnik-Chervonenkis dimension which is suitable for relational data.

Domain Adaptation using Metric Learning on Manifolds
Sridhar Mahadevan (University of Massachusetts, Amherst); Bamdev Mishra (Microsoft); Shalini Ghosh (Samsung Research)
Time: Thursday, 17.00

In domain adaptation, machine learning methods trained on a source domain, which has label information, must be modified to fit a target test domain whose feature distribution is assumed to vary from the source (e.g., ranking Amazon book reviews given labeled movie reviews). Various approaches to domain adaptation have been studied in the literature, ranging from geometric approaches to statistical approaches. One popular linear algebraic approach is based on aligning second order statistics, principally the covariances, between source and target domains. In this paper, we provide a new framework for domain adaptation, based on formulating transfer from source to target as a problem of metric learning on manifolds. Specifically, our approach is based on exploiting the Riemannian manifold geometry of symmetric positive definite (SPD) covariance matrices. We show that the domain adaptation problem of aligning source and target covariances can be reduced to solving the well-known Riccati equation, which has an exact solution on the manifold of SPD matrices involving the geometric or sharp mean. We additionally constrain the Ricatti based solution to reflect the underlying geometry of the source and target domains using diffusions on the underlying source and target manifolds. We also introduce an additional component of statistical alignment by minimizing the Maximum Mean Discrepancy (MMD) metric. A key strength of our proposed approach is that it enables integrating multiple sources of variation between source and target in a unified way, by reducing the combined objective function to a nested set of Ricatti equations. In addition to showing the theoretical optimality of our solution, we present detailed experiments using standard transfer learning testbeds from computer vision comparing our proposed algorithms to past work in domain adaptation, showing improved results over a large variety of previous methods.

Scalable Nonlinear AUC Maximization Methods
Majdi Khalid (Colorado State University)
Time: Thursday, 17.20

The area under the ROC curve (AUC) is a widely used measure for evaluating classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structures underlying most real world-data. However, the high training complexity renders the kernelized AUC machines infeasible for large-scale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize linear classifiers over a finite-dimensional feature space constructed via the k-means Nystr\'{o}m approximation. Our first algorithm maximizes the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the second-order batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivates us to develop a first-order stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. We also show experimentally that the proposed stochastic AUC classifier is able to reach the optimal solution, while the other state-of-the-art online AUC maximization methods are prone to suboptimal convergence.

Session: Probabilistic Models and Statistical Methods
Location: Hogan Mezz 1
Time: Thursday, 16:00

Accurate parameter estimation for Bayesian Network Classifiers using Hierarchical Dirichlet Processes
Francois Petitjean (Monash University), Wray Buntine (Monash University), Geoffrey I. Webb (Monash University), Nayyar Zaidi (Monash University)
Time: Thursday, 16.00

This paper introduces a novel parameter estimation method for the probability tables of Bayesian network classifiers (BNCs), using hierarchical Dirichlet processes (HDPs). The main result of this paper is to show that improved parameter estimation allows BNCs to outperform leading learning methods such as Random Forest for both 0-1 loss and RMSE, albeit just on categorical datasets.As data assets become larger, entering the hyped world of ‘big’, efficient accurate classification requires three main elements: (1) classifiers with low-bias that can capture the fine-detail of large datasets (2) out-of-core learners that can learn from data without having to hold it all in main memory and (3) models that can classify new data very efficiently.The latest Bayesian network classifiers (BNCs) satisfy these requirements. Their bias can be controlled easily by increasing the number of parents of the nodes in the graph. Their structure can be learned out of core with a limited number of passes over the data. However, as the bias is made lower to accurately model classification tasks, so is the accuracy of their parameters’ estimates, as each parameter is estimated from ever decreasing quantities of data. In this paper, we introduce the use of Hierarchical Dirichlet Processes for accurate BNC parameter estimation.We conduct an extensive set of experiments on 68 standard datasets and demonstrate that our resulting classifiers perform very competitively with Random Forest in terms of prediction, while keeping the out-of-core capability and superior classification time.

Approximated Structural Learning for Large Bayesian Networks
Mauro Scanagatta (IDSIA), Giorgio Corani (IDSIA), Cassio P. de Campos (Queen’s University Belfast and Utrecht University), Marco Zaffalon (IDSIA)
Time: Thursday, 16.20

We present approximate structure learning algorithms for Bayesian networks. We discuss the two main phases of the task: the preparation of the cache of the scores and structure optimization, both with bounded and unbounded treewidth. We improve on state-of-the-art methods that rely on an ordering-based search by sampling more effectively the space of the orders. This allows for a remarkable improvement in learning Bayesian networks from thousands of variables. We also present a thorough study of the accuracy and the running time of inference, comparing bounded-treewidth and unbounded-treewidth models.

Variational Bayes for Mixture Models with Censored Data
Masahiro Kohjima (NTT Corporation); Tatsushi Matsubayashi (NTT Corporation); Hiroyuki Toda (NTT Corporation)
Time: Thursday, 16.40

In this paper, we propose a variational Bayesian algorithm for mixture models that can deal with censored data, which is the data under the situation that the exact value is known only when the value is within a certain range and otherwise only partial information is available. The proposed algorithm can be applied to any mixture model whose component distribution belongs to exponential family; it is a natural generalization of the variational Bayes that deals with “standard” data whose values are known. We confirm the effectiveness of the proposed algorithm by experiments on synthetic and real world data.

A Left-to-right Algorithm for Likelihood Estimation in Gamma-Poisson Factor Analysis
Joan Capdevila (Universitat Politècnica Catalunya); Jesús Cerquides (IIIA-CSIC); Jordi Torres (UPC Barcelona Tech & Barcelona Supercomputing Center); Francois Petitjean (Monash University); Wray Buntine (Monash University)
Time: Thursday, 17.00

Computing the probability of unseen documents is a natural evaluation task in topic modeling. Previous work has addressed this problem for the well-known Latent Dirichlet Allocation (LDA) model. However, the same problem for a more general class of topic models, referred here to as Gamma-Poisson Factor Analysis (GaP-FA), remains unexplored, which hampers a fair comparison between models. Recent findings on the exact marginal likelihood of GaP-FA enable the derivation of a closed-form expression. In this paper, we show that its exact computation grows exponentially with the number of topics and non-zero words in a document, thus being only solvable for relatively small models and short documents. Experimentation in various corpus also indicates that existing methods in the literature are unlikely to accurately estimate this probability. With that in mind, we propose L2R, a left-to-right sequential sampler that decomposes the document probability into a product of conditionals and estimates them separately. We then proceed by confirming that our estimator converges and is unbiased for both small and large collections.

[Add paper to your calendar]  [Download paper] Reproducible Research

Exploration Enhanced Expected Improvement for Bayesian Optimization
Julian M Berk (Deakin University); Vu Nguyen (Deakin University); Sunil Gupta (Deakin University, Australia); Santu Rana (Deakin University, Australia); Svetha Venkatesh (Deakin University)
Time: Thursday, 17.20

Bayesian optimization (BO) is a sample-efficient method for global optimization of expensive, noisy, black-box functions using probabilistic methods. The performance of a BO method depends on its selection strategy through an acquisition function. This must balance improving our understanding of the function in unknown regions (exploration) with locally improving on known promising samples (exploitation). Expected improvement (EI) is one of the most widely used acquisition functions for BO. Unfortunately, it has a tendency to over-exploit, meaning that it can be slow in finding new peaks. We propose a modification to EI that will allow for increased early exploration while providing similar exploitation once the system has been suitably explored. We also prove that our method has a sub-linear convergence rate and test it on a range of functions to compare its performance against the standard EI and other competing methods.

[Add paper to your calendar]  [Download paper] Reproducible Research

Session: Online Learning
Location: Hogan Mezz 2
Time: Thursday, 16:00

An Online Prediction Algorithm for Reinforcement Learning with Linear Function Approximation using Cross Entropy Method
Ajin George Joseph (University of Alberta and Indian Institute of Science), Shalabh Bhatnagar (Indian Institute of Science)
Time: Thursday, 16.00

In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i.e.}, estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy (CE) optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems. This demonstrates the competitiveness of our algorithms against least squares and other state-of-the-art algorithms in terms of computational efficiency, accuracy and stability.

Online Feature Selection by Adaptive Sub-gradient Methods
Tingting ZHAI (Nanjing university); Hao Wang (Nanjing university); Frederic Koriche (Univ. d’Artois); Yang Gao (Nanjing University)
Time: Thursday, 16.20

The overall goal of online feature selection is to iteratively select, from high-dimensional streaming data, a small, ”budgeted” number of features for constructing accurate predictors. In this paper, we address the online feature selection problem using novel truncation techniques for two online sub-gradient methods: Adaptive Regularized Dual Averaging (ARDA) and Adaptive Mirror Descent (AMD). The corresponding truncation-based algorithms are called B-ARDA and B-AMD, respectively. The key aspect of our truncation techniques is to take into account the magnitude of feature values in the current predictor, together with their frequency in the history of predictions. A detailed regret analysis for both algorithms is provided. Experiments on six high-dimensional datasets indicate that both B-ARDA and B-AMD outperform two advanced online feature selection algorithms, OFS and SOFS, especially when the number of selected features is small. Compared to sparse online learning algorithms that use $\ell_1$ regularization, B-ARDA is superior to $\ell_1$-ARDA, and B-AMD is superior to Ada-Fobos.

SpectralLeader: Online Spectral Learning for Single Topic Models
Tong Yu (Carnegie Mellon University); Branislav Kveton (Google Research); Zheng Wen (Adobe Research); Hung Bui (Google DeepMind); Ole J. Mengshoel (Carnegie Mellon University)
Time: Thursday, 16.40

We study the problem of learning a latent variable model online from a stream of data. Latent variable models are popular because they can explain observed data through unobserved concepts. These models have traditionally been studied in the offline setting. In the online setting, online expectation maximization (EM) is arguably the most popular approach for learning latent variable models. Although online EM is computationally efficient, it typically converges to a local optimum. In this work, we develop a new online learning algorithm for latent variable models, which we call SpectralLeader. SpectralLeader converges to the global optimum, and we derive a sublinear upper bound on its n-step regret in a single topic model. In both synthetic and real-world experiments, we show that SpectralLeader performs similarly to or better than online EM with tuned hyper-parameters.

Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees
Guiliang Liu (Simon Fraser University); Oliver Schulte (Simon Fraser University); Wang Zhu (Simon Fraser University); Qingcan Li (Simon Fraser University)
Time: Thursday, 17.00

Deep Reinforcement Learning (DRL) has achieved impressive success in many applications. A key component of many DRL models is a neural network representing a Q function, to estimate the expected cumulative reward following a state-action pair. The Q function neural network contains a lot of implicit knowledge about the RL problems, but often remains unexamined and uninterpreted. To our knowledge, this work develops the first mimic learning framework for Q functions in DRL. We introduce Linear Model U-trees (LMUTs) to approximate neural network predictions. An LMUT is learned using a novel on-line algorithm that is well-suited for an active play setting, where the mimic learner observes an ongoing interaction between the neural net and the environment. Empirical evaluation shows that an LMUT mimics a Q function substantially better than five baseline methods. The transparent tree structure of an LMUT facilitates understanding the network’s learned knowledge by analyzing feature influence, extracting rules, and highlighting the super-pixels in image inputs.

Online Learning of Weighted Relational Rules for Complex Event Recognition
Nikos Katzouris (NCSR “Demokritos”); Evangelos Michelioudakis (NCSR “Demokritos”); Alexander Artikis (NCSR’D); Paliouras Georgios (NCSR “Demokritos”)
Time: Thursday, 17.20

Systems for symbolic complex event recognition detect occurrences of events in time using a set of event definitions in the form of logical rules. The Event Calculus is a temporal logic that has been used as a basis in event recognition applications, providing among others, connections to techniques for learning such rules from data. We advance the state-of-the-art by combining an existing online algorithm for learning crisp relational structure with an online method for weight learning in Markov Logic Networks (MLN). The result is an algorithm that learns complex event patterns in the form of Event Calculus theories in the MLN semantics. We evaluate our approach on a challenging real-world application for activity recognition and show that it outperforms both its crisp predecessor and the only existing online structure and weight learner for MLN in terms of predictive performance, at the price of a small increase in training time.

[Add paper to your calendar]  [Download paper] Reproducible Research

Session: Nectar
Location: Davin
Time: Thursday, 16:00

Matrix Completion under Interval Uncertainty: Highlights
Jakub Marecek (IBM Research); Peter Richtarik (KAUST); Martin Takac (Lehigh)
Time: Thursday, 16.00

Over the past decade, low-rank matrix completion has been shown to be successful in a variety of applications, ranging from collaborative filtering in recommendation systems and click prediction in computational advertising, to in-painting in image processing and computer vision. Recently [European Journal of Operational Research 256(1), 35–42, 2017], we have presented a variant of the matrix completion problem inspired by robust optimisation, which has statistical performance superior to other proposed variants. Here, we show that a solver for this variant of matrix completion, which is based on alternating parallel coordinate descent, can provide better precision for a web-scale data set on a single machine faster than a recent distributed implementation on a cluster of comparable machines. We also reference several novel applications thereof in robust statistics, event detection, and computer vision.

A two-step approach for the prediction of mood levels based on diary data
Vincent Bremer (Leuphana University Lüneburg); Dennis Becker (Leuphana University Lüneburg); Tobias Genz (Leuphana University Lüneburg); Burkhardt Funk (Leuphana University Lüneburg); Dirk Lehr (Leuphana University Lüneburg)
Time: Thursday, 16.20

The analysis of diary data can increase insights into patients suffering from mental disorders and can help to personalize online interventions. We propose a two-step approach for such an analysis. We first categorize free text diary data into activity categories by applying a bag-of-words approach and explore recurrent neuronal networks to support this task. In a second step, we develop partial ordered logit models with varying levels of heterogeneity among clients to predict their mood. We estimate the parameters of these models by employing MCMC techniques and compare the models regarding their predictive performance. This two-step approach leads to an increased interpretability about the relationships between various activity categories and the individual mood level.

Best Practices to Train Deep Models on Imbalanced Datasets-A Case Study on Animal Detection in Aerial Imagery
Benjamin Kellenberger (Wageningen University and Research); Diego Marcos (Wageningen University); Devis Tuia (Wageningen University and Research)
Time: Thursday, 16.40

We introduce recommendations to train a Convolutional Neural Network for grid-based detection on a dataset that has a substantial class imbalance. These include curriculum learning, hard negative mining, a special border class, and more. We evaluate the recommendations on the problem of animal detection in aerial images, where we obtain an increase in precision from 9% to 40% at high recalls, compared to state-of-the-art.

Deep Query Ranking for Question Answering over Knowledge Bases
Hamid Zafar (University of Bonn); Giulio Napolitano (Fraunhofer IAIS); Jens Lehmann (Smart Data Analytics Group, University of Bonn)
Time: Thursday, 17.00

We study question answering systems over knowledge graphs which map an input natural language question into candidate formal queries. Often, a ranking mechanism is used to discern the queries with higher similarity to the given question.Considering the intrinsic complexity of the natural language, finding the most accurate formal counter-part is a challenging task.In our recent paper~\cite{zafar2018formal}, we leveraged Tree-LSTM to exploit the syntactical structure of input question as well as the candidate formal queries to compute the similarities. An empirical study shows that taking the structural information of the input question and candidate query into account enhances the performance, when compared to the baseline system.

Machine Learning Approaches to Hybrid Music Recommender Systems
Andreu Vall Portabella (Johannes Kepler University); Gerhard Widmer (Johannes Kepler University)
Time: Thursday, 17.20

Music recommender systems have become a key technology supporting the access to increasingly larger music catalogs in on-line music streaming services, on-line music shops, and private collections. The interaction of users with large music catalogs is a complex phenomenon researched from different disciplines. We survey our works investigating the machine learning and data mining aspects of hybrid music recommender systems (i.e., systems that integrate different recommendation techniques). We proposed hybrid music recommender systems based solely on data and robust to the so-called ‘cold-start problem’ for new music items, favoring the discovery of relevant but non-popular music. We thoroughly studied the specific task of music playlist continuation, by analyzing fundamental playlist characteristics, song feature representations, and the relationship between playlists and the songs therein.

Session: ADS Engineering and Design
Location: Nally
Time: Thursday, 16:00

On Optimizing Operational Efficiency in Storage Systems Via Deep Reinforcement Learning
Sunil Srinivasa (Samsung SDS Research America); Girish Kathalagiri (Samsung SDS Research America); Julu Subramanyam Varanasi (Samsung Semiconductors Inc); Luis Carlos Quintela (Samsung SDS Research America); Mohamad Charafeddine (Samsung SDS Research America); Chihoon Lee (Samsung SDS Research America)
Time: Thursday, 16.00

This paper deals with the application of deep reinforcement learning to optimize the operational efficiency of a solid state storage rack. Specifically, we train an on-policy and model-free policy gradient algorithm called the Advantage Actor-Critic (A2C). We deploy a dueling deep network architecture to extract features from the sensor readings off the rack and devise a novel utility function that is used to control the A2C algorithm. Experiments show performance gains greater than 30% over the default policy for deterministic as well as random data workloads.

Automating Layout Synthesis with Constructive Preference Elicitation
Luca Erculiani (University of Trento); Paolo Dragone (University of Trento); Stefano Teso (KU Leuven); Andrea Passerini (University of Trento)
Time: Thursday, 16.20

Layout synthesis refers to the problem of arranging objects subject to design preferences and structural constraints. Applications include furniture arrangement, space partitioning (e.g. subdividing a house into rooms), urban planning, and other design tasks. Computer-aided support systems are essential tools for architects and designers to produce custom, functional layouts. Existing systems, however, do not learn the designer’s preferences, and therefore fail to generalize across sessions or instances. We propose addressing layout synthesis by casting it as a constructive preference elicitation task. Our solution employs a coactive interaction protocol, whereby the system and the designer interact by mutually improving each other’s proposals. The system iteratively recommends layouts to the user, and learns the user’s preferences by observing her improvements to the recommendation. We apply our system to two design tasks, furniture arrangement and space partitioning, and report promising quantitative and qualitative results on both.

[Add paper to your calendar]  [Download paper] Reproducible Research

Helping your Docker images to spread based on explainable models
Riccardo Guidotti (University of Pisa); Jacopo Soldani (University of Pisa); Davide Neri (University of Pisa); Antonio Brogi (University of Pisa); Dino Pedreschi (University of Pisa, Italy)
Time: Thursday, 16.40

Docker is on the rise in today’s enterprise in information technology. Docker permits shipping applications inside portable containers, which run from so-called Docker images. Docker images are distributed in public registries, which also provide mechanisms to monitor their popularity. The popularity of an image is an information of trust, and it directly impacts on its usage. We present a solution based on interpretable decision tree and regression trees for estimating the popularity of a given Docker image, and for understanding how to improve an image to increase its popularity. The results presented in this work can provide valuable insights to Docker image developers, and can be exploited for designing better and more competitive software products.

Towards Resource-Efficient Classifiers for Always-On Monitoring
Jonas Vlasselaer (KU Leuven); Wannes Meert (KU Leuven); Marian Verhelst (KU Leuven)
Time: Thursday, 17.00

Emerging applications such as natural user interfaces or smart homes create a rising interest in electronic devices that have always-on sensing and monitoring capabilities. As these devices typically have limited computational resources and require battery powered operation, the challenge lies in the development of processing and classification methods that can operate under extremely scarce resource conditions. To address this challenge, we propose a two-layered computational model which enables an enhanced trade-off between computational cost and classification accuracy: The bottom layer consists of a selection of state-of-the-art classifiers, each having a different computational cost to generate the required features and to evaluate the classifier itself. For the top layer, we propose to use a Dynamic Bayesian network which allows to not only reason about the output of the various bottom-layer classifiers, but also to take into account additional information from the past to determine the present state. Furthermore, we introduce the use of the Same Decision Probability to reason about the added value of the bottom-layer classifiers and selectively activate their computations to dynamically exploit the computational cost versus classification accuracy trade-off space. We validate our methods on the real-world SINS database, where domestic activities are recorded with an acoustic sensor network, as well as the Human Activity Recognition (HAR) benchmark dataset.

Configuration of Industrial Automation Solutions Using Multi-relational Recommender Systems
Marcel Hildebrandt (Siemens ); Swathi Shyam Sunder (Siemens)
Time: Thursday, 17.20

Building complex automation solutions, common to process industries or building automation, requires the selection of components early on in the engineering process. Typically, recommender systems guide the user in the selection of appropriate components and, in doing so, take into account varying levels of context information. Many popular shopping basket recommender systems are based on collaborative filtering. While generating competent personalized recommendations, these methods rely solely on observed user behavior and are usually context free. Moreover, their limited expressiveness makes them less valuable when used for setting up complex engineering solutions. Product configurators based on deterministic, handcrafted rules may better tackle these use cases. However, besides being rather static and inflexible, such systems are laborious to develop and require domain expertise. In this work, we study various approaches to generate recommendations when building complex engineering solutions. Our aim is to exploit statistical patterns in the data that contain a lot of predictive power and are considerably more flexible than strict, deterministic rules. To achieve this, we propose a generic recommendation method for complex, industrial solutions that incorporates both past user behavior and semantic information in a joint knowledge base. This results in a graph-structured, multi-relational data description – commonly referred to as a knowledge graph. In this setting, predicting user preference towards an item corresponds to predicting an edge in this graph. Despite its simplicity concerning data preparation and maintenance, our recommender system proves to be powerful, as shown in extensive experiments with real-world data where our model outperforms several state-of-the-art methods. Furthermore, after model training, recommending new items can be done efficiently, ensuring that our method can operate in real time when assisting users in configuring solutions.

ST-DenNetFus: A New Deep Learning Approach for Network Demand Prediction
Haytham Assem (IBM); Bora Caglayan (IBM ); Teodora Sandra Buda (IBM); Declan O’Sullivan (Trinity College Dublin)
Time: Thursday, 17.40

Network Demand Prediction is of great importance to network planning and dynamically allocating network resources based on the predicted demand, this can be very challenging as it is affected by many complex factors, including spatial dependencies, temporal dependencies, and external factors (such as regions’ functionality and crowd patterns as it will be shown in this paper). We propose a deep learning based approach called, ST-DenNetFus, to predict network demand (i.e. uplink and downlink throughput) in every region of a city. ST-DenNetFus is an end to end architecture for capturing unique properties from spatio-temporal data. ST-DenNetFus employs various branches of dense neural networks for capturing temporal closeness, period, and trend properties. For each of these properties, dense convolutional neural units are used for capturing the spatial properties of the network demand across various regions in a city. Furthermore, ST-DenNetFus introduces extra branches for fusing external data sources that have not been considered before in the network demand prediction problem of various dimensionalities. In our case, these external factors are the crowd mobility patterns, temporal functional regions, and the day of the week. We present an extensive experimental evaluation for the proposed approach using two types of network throughput (uplink and downlink) in New York City (NYC), where ST-DenNetFus outperforms four well-known baselines.

Session: Demo & Poster 2
Location: Hogan
Time: Thursday, 18:30

Tiler: Software for Human-Guided Data Exploration
Andreas Henelius (Aalto University); Emilia Oikarinen (Aalto University); Kai Puolamäki (Aalto University)
Time: Thursday, 18.30

Understanding relations in datasets is important for the successful application of data mining and machine learning methods. This paper describes Tiler, a software tool for interactive visual explorative data analysis realising the interactive Human-Guided Data Exploration framework. Tiler allows a user to formulate different hypotheses concerning the relations in a dataset. Data samples corresponding to these hypotheses are then compared visually, allowing the user to gain insight into relations in the dataset. The exploration process is iterative and the user gradually builds up his or her understanding of the data.

ADAGIO: Interactive Experimentation with Adversarial Attack and Defense for Audio
Nilaksh Das (Georgia Institute of Technology); Madhuri Shanbhogue (Georgia Institute of Technology); Shang-Tse Chen (Georgia Institute of Technology); Li Chen (Intel Corporation); Michael E. Kounavis (Intel Corporation); Duen Horng Chau (Georgia Institute of Technology)
Time: Thursday, 18.30

Adversarial machine learning research has recently demonstrated the feasibility to confuse automatic speech recognition (ASR) models by introducing acoustically imperceptible perturbations to audio samples. To help researchers and practitioners gain better understanding of the impact of such attacks, and to provide them with tools to help them more easily evaluate and craft strong defenses for their models, we present ADAGIO, the first tool designed to allow interactive experimentation with adversarial attacks and defenses on an ASR model in real time, both visually and aurally. ADAGIO incorporates AMR and MP3 audio compression techniques as defenses, which users can interactively apply to attacked audio samples. We show that these techniques, which are based on psychoacoustic principles, effectively eliminate targeted attacks, reducing the attack success rate from 92.5% to 0%. We will demonstrate ADAGIO and invite the audience to try it on the Mozilla Common Voice dataset.

ClaRe: Classification and Regression Tool for Multivariate Time Series
Ricardo Cachucho (Leiden University); Stylianos Paraschiakos (Leiden University Medical Center); Kaihua Liu (Leiden University); Benjamin van der Burgh (Leiden University); Arno Knobbe (Leiden University)
Time: Thursday, 18.30

As sensing and monitoring technology becomes more and more common, multiple scientific domains have to deal with big multivariate time series data. Whether one is in the field of finance, life science and health, engineering, sports or child psychology, being able to analyze and model multivariate time series has become of high importance. As a result, there is an increased interest in multivariate time series data methodologies, to which the data mining and machine learning communities respond with a vast literature on new time series methods. However, there is a major challenge that is commonly overlooked; most of the broad audience of end users lack the knowledge on how to implement and use such methods. To bridge the gap between users and multivariate time series methods, we introduce the ClaRe dashboard. This open source web-based tool, provides to a broad audience a new intuitive data mining methodology for regression and classification tasks over time series.

Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning
Susan Leavy (University College Dublin)
Time: Thursday, 18.30

We present a text mining system to support the exploration of large volumes of text detailing the findings of government inquiries. Despite their historical significance and potential societal impact, key findings of inquiries are often hidden within lengthy documents and remain inaccessible to the general public. We transform the findings of the Irish government’s inquiry into industrial schools and through the use of word embedding, text classification and visualization, present an interactive web-based platform that enables the exploration of the text in new ways to uncover new historical insights.

Monitoring Emergency First Responders’ Activities via Gradient Boosting and Inertial Sensor Data
Sebastian Scheurer (Insight Centre for Data Analytics, University College Cork)
Time: Thursday, 18.30

Emergency first response teams during operations expend much time to communicate their current location and status with their leader over noisy radio communication systems. We are developing a modular system to provide as much of that information as possible to team leaders. One component of the system is a human activity recognition (HAR) algorithm, which applies an ensemble of gradient boosted decision trees (GBT) to features extracted from inertial data captured by a wireless-enabled device, to infer what activity a first responder is engaged in. An easy-to-use smartphone application can be used to monitor up to four first responders’ activities, visualise the current activity, and inspect the GBT output in more detail.