# Publications

#### Image Feature Extraction and Demand Estimation on Airbnb: A Deep Learning Approach

Shunyuan Zhang, Dokyun Lee, Param Vir Singh, Kannan Srinivasan

*Workshop on Information Systems and Economics ( WISE) 2016*, Dublin, Ireland.

#### Professional versus Amateur Images: Investigating Differential Impact on Airbnb Property Demand

Shunyuan Zhang, Dokyun Lee, Param Vir Singh, Kannan Srinivasan

*Conference on Information Systems and Technology* *( CIST) 2016*, Nashville, TN.

**[WINNER OF BEST STUDENT PAPER AWARD AT CIST 2016]**

#### How Much Is An Image Worth? An Empirical Analysis of Property’s Image Aesthetic Quality on Demand at AirBNB

Shunyuan Zhang, Dokyun Lee, Param Vir Singh, Kannan Srinivasan

*International Conference in Information Systems ( ICIS) 2016*, Dublin, Ireland.

#### A Structured Analysis of Unstructured Big Data Leveraging Cloud Computing

Xiao Liu, Param Vir Singh, Kannan Srinivasan

Marketing Science, 35(3) 2016, 363-388.

#### Temporary title

Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

PVLDB 9(3):180-191, 2017

selection [paper (VLDB)], [bib]

[paper (VLDB)], [bib]

Full 36 page version with all proofs (): [paper (arXiv:1507.00674)], (Version July 2015)

Project page: Causality

Abstract

Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as “the resilience of a query” and show its connections to the well-studied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for self-join-free conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source side-effects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of self-join-free conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source side-effects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.

#### Rules of thumb for information acquisition from large and redundant data

Wolfgang Gatterbauer

ECIR 2011, pp. 479-490.

selection [paper], [slides], [slides], [bib],

Full 40 page version with all proofs (arXiv:1012.3502): [paper (arXiv:1012.3502)], [bib], (Version Dec 2010)

Project page: Unique recall

Assume you crawl 20% of the Web. Are you able to learn 80% of the available information? This paper develops an analytic model (and uses generally accepted assumptions of power laws distributions in data) to show that we can expect to learn less then 40% of the Web’s content, hence the 80-20 rules does not hold. The paper further describes a new family of power law distribution which remains invariant under sampling, i.e. randomly sampling from this distribution will lead again to the original distribution in the sample.

#### QueryViz: Helping users understand SQL queries and their patterns

Jonathan Danaparamita, Wolfgang Gatterbauer

EDBT 2011, pp. 558-561. (System demonstration)

selection [demonstration paper], [demonstration paper (ACM version)], [bib]

Project page: QueryViz

#### Databases will visualize queries too

Wolfgang Gatterbauer

PVLDB 4(12):1498-1501, 2011. (VLDB challenges and visions track)

[paper], [narrated slides (16MB)], [video (19min)], [bib]

Project page: QueryViz

This paper describes a *human-query interaction* pattern in which users re-use existing queries as templates to compose their own queries. This interaction mode is only made possible with new visualization tools which help users understand *SQL patterns* and the intent of existing SQL queries quickly. QueryViz is our new visualization approach.

#### A Structural Model of Employee Behavioral Dynamics in Enterprise Social Media

Yan Huang, Param Vir Singh, and Anindya Ghose.

Management Science, 61(12) 2015, 2825-2844.

#### Knowledge Sharing in Online Communities: Learning to Cross Geographic and Hierarchical Boundaries

Elina Hwang, Param Vir Singh and Linda Argote

Organization Science, 26(6) 2015, 1593-1611.

#### Forgotten Third Parties: Analyzing the Contingent Association between Unshared Third Parties, Knowledge Overlap and konwledge Transfer Relationships with Outsiders

Ray Reagans, Param Vir Singh, and Ramayya Krishnan

Organization Science, 26(5) 2015, 1400-1414.

#### Trade-offs in Online Advertising: Modeling and Measuring Advertising Effectiveness and Annoyance Dynamics

Vilma Georgia Todri, Param Vir Singh, Anindya Ghose

*Workshop on Information Systems and Economics ( WISE 2015)*, Texas, USA.

#### The complexity of resilience and responsibility for self-join-free conjunctive queries

Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

PVLDB 9(3):180-191, 2015.

selection [paper (VLDB)], [bib]

[paper (VLDB)], [bib]

Full 36 page version with all proofs (arXiv:1507.00674): [paper (arXiv:1507.00674)], (Version July 2015)

Project page: Causality

Several research thrusts in the area of data management have focused on understanding how changes in the data affect the output of a view or standing query. Example applications are explaining query results, propagating updates through views, and anonymizing datasets. These applications usually rely on understanding how interventions in a database impact the output of a query. An important aspect of this analysis is the problem of deleting a minimum number of tuples from the input tables to make a given Boolean query false. We refer to this problem as “the resilience of a query” and show its connections to the well-studied problems of deletion propagation and causal responsibility. In this paper, we study the complexity of resilience for self-join-free conjunctive queries, and also make several contributions to previous known results for the problems of deletion propagation with source side-effects and causal responsibility: (1) We define the notion of resilience and provide a complete dichotomy for the class of self-join-free conjunctive queries with arbitrary functional dependencies; this dichotomy also extends and generalizes previous tractability results on deletion propagation with source side-effects. (2) We formalize the connection between resilience and causal responsibility, and show that resilience has a larger class of tractable queries than responsibility. (3) We identify a mistake in a previous dichotomy for the problem of causal responsibility and offer a revised characterization based on new, simpler, and more intuitive notions. (4) Finally, we extend the dichotomy for causal responsibility in two ways: (a) we treat cases where the input tables contain functional dependencies, and (b) we compute responsibility for a set of tuples specified via wildcards.

#### Linearized and single-pass belief propagation

Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, Christos Faloutsos

PVLDB 8(5):581-592, 2015.

selection [paper (VLDB)], [slides (2MB)], [narrated slides (32MB), annotations only work on Windows], [video (21min)], [Python code on Github], [SQL code on Github], [bib]

Full 18 page version with all proofs (arXiv:1406.7288): [paper (arXiv:1406.7288)], (Version Oct 2014)

Project page: SSL-H

How can we tell when accounts are fake or real in a social network? And how can we tell which accounts belong to liberal, conservative or centrist users? Often, we can answer such questions and label nodes in a network based on the labels of their neighbors and appropriate assumptions of homophily (“birds of a feather flock together”) or heterophily (“opposites attract”). One of the most widely used methods for this kind of inference is Belief Propagation (BP) which iteratively propagates the information from a few nodes with explicit labels throughout a network until convergence. A well-known problem with BP, however, is that there are no known exact guarantees of convergence in graphs with loops. This paper introduces Linearized Belief Propagation (LinBP), a linearization of BP that allows a closed-form solution via intuitive matrix equations and, thus, comes with exact convergence guarantees. It handles homophily, heterophily, and more general cases that arise in multi-class settings. Plus, it allows a compact implementation in SQL. The paper also introduces Single-pass Belief Propagation (SBP), a localized (or “myopic”) version of LinBP that propagates information across every edge at most once and for which the final class assignments depend only on the nearest labeled neighbors. In addition, SBP allows fast incremental updates in dynamic networks. Our runtime experiments show that LinBP and SBP are orders of magnitude faster than standard BP, while leading to almost identical node labels.

#### Approximate lifted inference with probabilistic databases

Wolfgang Gatterbauer, Dan Suciu

PVLDB 8(5):629-640, 2015.

selection [paper (VLDB)], [paper (arXiv:1412.1069)], [bib]

Invited to the Special Issue of VLDB Journal on VLDB 2015.

Project page: Propagation

This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers.

#### The linearization of pairwise Markov networks

Wolfgang Gatterbauer

arXiv:1502.04956.

new [working paper (arXiv:1502.04956)], (Version Feb 2015)

Our prior work “Linearized and Single-Pass Belief Propagation ” proposed to approximate the solution of loopy Belief Propagation in graphs (i.e. pairwise Markov networks) with one that requires to solve a simple linear equation system. That work was still restricted to the case in which all edges in the network carry the same symmetric, doubly stochastic potential. This paper generalizes that approach to any pairwise Markov network.

#### Semi-supervised learning with heterophily

Wolfgang Gatterbauer

arXiv:1412.3100.

new [working paper (arXiv:1412.3100)], (Version Dec 2014)

Project page: SSL-H

We propose a novel linear semi-supervised learning formulation that is derived from a solid probabilistic framework: belief propagation. We show that our formulation generalizes a number of label propagation algorithms described in the literature by allowing them to propagate generalized assumptions about influences between classes of neighboring nodes. We call this formulation Semi-Supervised Learning with Heterophily (SSL-H). We also show how the modularization matrix can be learned from observed data with a simple convex optimization framework that is inspired by locally linear embedding. We call this approach Linear Heterophily Estimation (LHE). Experiments on synthetic data show that both approaches combined can learn heterophily of a graph with 1M nodes and 10M edges in under 1min.

#### Oblivious bounds on the probability of Boolean functions

Wolfgang Gatterbauer, Dan Suciu

ACM TODS, vol. 39, no. 1, pp. 191-208, 2014.

selection [paper (ACM)], [paper (local preprint)], [paper (arXiv:1409.6052)], [SQL code], [bib]

Prior superseded working paper “Optimal Upper and Lower Bounds for Boolean Expressions by Dissociation”: arXiv:1105.2813

Project page: Propagation

This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach *dissociation* and give an exact characterization of *optimal oblivious bounds*, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the *weighted model counting* problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.

#### Counterexamples to commonly held assumptions on unit commitment and market power assessment

Wolfgang Gatterbauer, Marija Ilic

Chapter 10 of: Engineering IT-Enabled Sustainable Electricity Services, Marija Ilic, Le Xie, Qizing Liu (Eds.), Springer, 2013.

[chapter]

Within the context of the ongoing deregulation of the electricity industry, we disprove in the first part the commonly stated assumption that, in theory and under the condition of perfect information, decentralized and centralized unit commitment would lead to the same power quantities traded and, hence, to the same optimal social welfare. We show that, even in the absence of any uncertainties, independent optimization of the individual performance objectives by the decentralized market participants can actually lead to lower efficiency than centralized minimization of total operating cost. This result concerns short-term supply optimization for a given demand, and does not consider long-term investment issues. In the second part, we take the position of an individual market participant in a deregulated electricity market. We investigate the task of optimally self-scheduling generators to maximize profits by using stochastic dynamic programming. With the help of actual price forecast data from the ISO New England electricity wholesale market, we demonstrate the improvements that result from modeling the forecast price errors assuming a Cauchy error distribution instead of the commonly used Normal distribution.

#### Default-all is dangerous!

Wolfgang Gatterbauer, Alexandra Meliou, Dan Suciu

TaPP 2011.

selection [paper], [paper (arXiv:1105.4395)], [slides], [slides], [bib]

Project pages: BeliefDB, Causality

This paper shows that the “default-all propagation” scheme for database annotations can have some problematic semantic consequences. We propose an alternative “minimum-propagation” provenance that fixes the issue and that comes with several desirable properties.

#### Tracing data errors with view-conditioned causality

Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu

SIGMOD 2011, pp. 505-516.

selection [paper], [paper (ACM version)], [bib]

Project page: Causality

This paper shows how causal reasoning can be used for *post-factum data cleaning*, where errors are detected after data has been transformed (e.g. by a query) and which need to be corrected in the original input data. We achieve this with a novel way for translating the problem into a SAT and a weighted MAX-SAT problem, and then using very effective existing tools.

#### Managing structured collections of community data

Wolfgang Gatterbauer, Dan Suciu

CIDR 2011, pp. 207-210. (Outrageous ideas and vision track)

[vision paper], [slides], [slides], [bib]

Project page: BeliefDB

#### Reverse data management

Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu

PVLDB 4(12):1490-1493, 2011. (VLDB challenges and visions track)

[paper], [bib]

Project page: Causality

#### Bringing provenance to its full potential using causal reasoning

Alexandra Meliou, Wolfgang Gatterbauer, Dan Suciu

TaPP 2011.

[paper], [bib]

Project page: Causality

#### Session-based browsing for better query reuse

Nodira Khoussainova, Yongchul Kwon, Wei-Ting Liao, Magdalena Balazinska, Wolfgang Gatterbauer, Dan Suciu

SSDBM 2011, pp. 583-585. (Poster)

[poster paper]

Full 10 page version (UW CSE TR 11-04-02): [technical report]

Project page: CQMS

#### Data conflict resolution using trust mappings

Wolfgang Gatterbauer, Dan Suciu

SIGMOD 2010, pp. 219-230.

selection [paper], [slides], [slides], [poster], [bib]

Full 20 page version with all proofs (arXiv:1012.3320), extended UW CSE TR 09-11-01: [paper], [bib], (Version Dec 2010)

Project page: BeliefDB

#### The complexity of causality and responsibility for query answers and non-answers

Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu

PVLDB 4(1):34-45, 2010.

selection [paper], [bib]

Full 15 page version with all proofs (arXiv:1009.2021): [paper], [bib], (Version Sept 2010)

Project page: Causality

#### Dissociation and propagation for efficient query evaluation over probabilistic databases

Wolfgang Gatterbauer, Abhay K. Jha, Dan Suciu

MUD 2010, pp. 83-97.

[paper], [bib]

Full 29 page version with all proofs (arXiv:1310.6257): [paper], (Version Aug 2014)

Project page: Propagation

#### Causality in databases

Alexandra Meliou, Wolfgang Gatterbauer, Joseph Halpern, Christoph Koch, Katherine F. Moore, Dan Suciu

IEEE Data Engineering Bulletin, Special issue on Data Provenance, 33(3):59-67, Sept. 2010.

[paper], [bib]

Project page: Causality

#### Why so? or Why no? Functional causality for explaining query answers

Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, Dan Suciu

MUD 2010, pp. 3-17.

[paper], [bib]

Full 18 page version with all proofs (arXiv:0912.5340), also UW CSE TR 09-12-01: [paper], [bib], (Version Dec 2009)

Project page: Causality

#### Visual structure-based web page clustering and retrieval

Paul Bohunsky, Wolfgang Gatterbauer

WWW 2010, pp. 1067-1068. (Poster)

[poster paper], [poster paper (ACM version)], [poster], [bib]

Former project page (not maintained anymore): VENTex

Online demo: VisBox

Example DIV table: TABLE vs DIV table

Last modified: March 3, 2017