Oganizers: Elena Grigorescu

Our seminar will meet sporadically this semester, mostly on Mondays in LWSN 1106. To accomodate some of our speakers we will sometimes meet in a different room and at a different time. Please check the calendar for the exact details. To get announcement before the talk, please also join our mailing list.

The seminar is funded by the Center for Science of Information and the CS Department.

Monday Aug 26

Elena Grigorescu

Purdue U

Title:Statistical algorithms and a lower bound for planted clique,HAAS G66, 10:30 am. Note non-standard room.

Abstract:We introduce a framework for proving lower bounds on computational problems over distributions, based on defining a restricted class of algorithms called statistical algorithms. For such algorithms, access to the input distribution is limited to obtaining an estimate of the expectation of any given function on a sample drawn random ly from the input distribution, rather than directly accessing samples. Our definition and techniques are inspired by and generalize the statistical query model in lea rning theory. For well-known problems over distributions, we give lower bounds on the complexity of any statistical algorithm. These include a nearly optimal lower bound for detecting planted bipartite clique distributions (or planted dense subgraph distributions) when the planted clique has size O(n^{1/2-delta}) for any constant delta\ge 0. Variants of the latter have been assumed to be hard to prove hardness for other problems and for cryptographic applications. Our lower bounds provide concrete evidence supporting these assumptions.

Joint work with Vitaly Feldman, Lev Reyzin, Santosh Vempala, Ying Xiao.Monday Sep 16

Brendan Juba

Harvard U

Title:Efficient reasoning in PAC semanticsLWSN 1106. 11:30am-12:30

Abstract:Machine learning is often employed as one step in a larger application, serving to perform information extraction or data mining for example. The rules obtained by such learning are then used as inputs to a further analysis. As a consequence of the inputs to this analysis having been learned, however, its conclusions can only be theoretically guaranteed to satisfy the same standard as the inputs---that of "PAC Semantics" (Valiant, 2000). In this talk, we explore the benefits of taking the problems of learning and logical reasoning together, capturing a relatively general family of such applications. Briefly, the benefit is that we can simultaneously (1) handle incomplete information, (2) utilize rules that are a reasonable but imperfect fit for the data, and (3) reach conclusions more efficiently than is achieved by known algorithms for reasoning from rules alone. Precisely, we consider a problem of testing the validity of a "query" formula (hypothesis) against incomplete data. We present algorithms for soundly verifying such queries that succeed whenever there exist learnable rules that suffice to complete a simple proof of the query, matching what can be achieved by such a two-stage learning and reasoning process.

Monday Sep 23

Karthik Chandrasekaran

Harvard U

Title: Faster private release of marginals on small databasesLWSN 1106, 11:30 -12:30

Abstract:Title: Faster private release of marginals on small databases The goal of privacy-preserving data analysis is to enable rich statistical analyses on the database while protecting the privacy of the individual records. One of the most important classes of statistics on a dataset is its marginals. The answer to a k-way marginal query is the fraction of database's records with a given value in each of a given set of up to k attributes. In this work, we focus on the notion of differential privacy to answer marginal queries. Early work in differential privacy gave algorithms that are private provided the number of records in the database is comparable to (or larger than) the number of queries. Although a remarkable line of work improved this requirement to achieve privacy on small databases, this was at the expense of an exponential blow up in the run time. In this work, we give an algorithm with improved run time (sub-exponential) for privately answering marginal queries on small databases. Our algorithm uses a new approximate representation of the database that is inspired by tools from learning theory. This representation is derived by approximating the disjunction on low Hamming weight inputs using low-degree polynomials with coefficients of bounded L1-norm. We show new upper and lower bounds for such polynomials. Our lower bound matches our upper bound and thus, exhibits the tightness of our approach when polynomials are used to derive the approximate representation of the database.

Joint work with J. Ullman, J. Thaler and A. WanFriday Oct 11

Qin Zhang

Indiana U

LWSN B134 from 11:30-12:30 pm. (Note unusual day, and place)

Title: Lower Bound Techniques for Multiparty Communication Complexity

Abstract:In this talk we will discuss multiparty communication complexity in the message-passing model, where we have k machines, each having a piece of data and they want to jointly compute some function defined on the union of the k data sets via communication. The communication is point-to-point, and the goal is to minimize the total communication between the k sites. This model captures all point-to-point distributed computational models for big data in terms of communication complexity, including the BSP model and the MapReduce model. Problems considered in this model include statistical & graph problems, numerical linear algebra and database queries. In this talk we will introduce new techniques developed in the last two years for proving communication lower bounds in the message-passing model. We believe that these techniques will have a wide applicability.

Monday Oct 21

Liu Yang

CMU/MIT

Title: Active Testing Boolean and Real-valued FunctionsLWSN 1106 11:30-12:20

Abstract:One motivation for property testing is that testing can provide a fast preprocessing step before learning. However, algorithms based on membership queries (i.e., the ability to query functions on arbitrary points) tend to query highly ambiguous or unnatural points that can be impossible for a human oracle to label. In this work, we analyze a more realistic model where the algorithm may query for labels only on points in a given (polynomial-sized) unlabeled sample drawn from some underlying distribution. Further, we develop a general notion of the testing dimension of a given property with respect to a given distribution and use that to establish a number of lower bounds for the class of dictator functions, linear separators, etc. We will also discuss a recent generalization of this work to testing properties of real-valued functions. In particular, we have shown that it is possible to test piecewise constant functions, with d pieces, using a number of label queries independent of d. We also discuss extensions of this result to the more-general problem of testing piecewise degree-n polynomials, with d pieces, where again we are interested in using a number of label queries independent of d.Monday Nov 4

Pooya Hatami

U of Chicago

Title: Algorithmic Regularity for Polynomials and ApplicationsLWSN 1106, 11:30-12:20

Abstract:In analogy with the regularity lemma of Szemeredi, regularity lemmas for polynomials given by Green and Tao and by Kaufman and Lovett give a way of modifying a given collection F of polynomials to a new collection F', so that the polynomials in the new collection are ``pseudorandom''. These lemmas have various applications, such as (special cases) of Reed-Muller testing and worst-case to average-case reductions for polynomials. However, the transformation from F to F' is not algorithmic for either regularity lemma. We define new notions of regularity for polynomials, which are analogous to the above, but allow for an efficient algorithm to compute the pseudorandom collection F'. We also prove that our notions suffice for the applications, for which the above regularity lemmas were proved.

Joint work with Arnab Bhattacharyya and Madhur Tulsiani.Wednesday Nov 13

David Woodruff

IBM Research Almaden

(Note Unusual day, time, place) CS COLOQUIUM TALK. 2:30pm LWSN B155

Title: Sketching as a Tool for Numerical Linear Algebra

Abstract:I will discuss how sketching techniques from the data stream literature can be used to speed up well-studied algorithms for problems occurring in numerical linear algebra, such as least squares regression and approximate singular value decomposition. I will also discuss how they can be used to achieve very efficient algorithms for variants of these problems, such as robust and structured regression. In many cases the algorithms obtained using sketching are the only known ways to obtain asymptotically optimal algorithms.

Bio: David Woodruff is a research scientist at IBM Almaden, where he joined after getting his PhD from MIT in 2007. He's interested in compressed sensing, numerical linear algebra, sketching and streaming.Monday Nov 18

Sebastian Cioaba

U of Delaware

CANCELLED BECAUSE OF THE TORNADO! Title: Spectral Graph TheoryLWSN 1106 11:30-12:20

Abstract:While they do not determine the graph in general, the eigenvalues of a graph capture many important structural properties. In this talk, I will describe some classical and more recent connections between eigenvalues and combinatorial properties of graphs.

Monday Dec 2

Madhu Sudan

MSR/MIT

(Note unusual time, place) S. CONTE DISTINGUISHED LECTURE. 3:30pm-4:30pm. LWSN 3102 A/B

Title: Communication Amid Uncertainty

Abstract:The classical theory of communication assumes perfect coordination between sender and receiver of information, to develop a beautiful mathematical theory that ensures reliable efficient communication. Natural communication, for example, between humans, is however characterized by a lack of perfect agreement among the communicating players. Arguably the uncertainty created by lack of perfect agreement plays a significant role in the development of human languages, which tend to have bendable rules, ambiguous dictionaries, and seemingly needless redundancies, phenomena that are not seen in ``designed communication''. In this talk we describe how some of the basic communication tasks take on new flavors when dealing with such uncertainty and describe some of mathematical problems that emerge.

Based on joint works with Elad Haramaty (Technion), Brendan Juba (Harvard), Adam Kalai (MSR), and Sanjeev Khanna (U.Penn.).

Bio:Madhu Sudan got his Bachelors degree from IIT Delhi in 1987 and his Ph.D. from UC Berkeley in 1992. From 1992-1997 he was a Research Staff Member at IBM's Thomas J. Watson Research Center. In 1997 he joined the faculty at MIT, where among other roles he served as an Associate Director of MIT's CSAIL from 2007-2009. In 2009, Madhu Sudan joined Microsoft Research at their New England Research Center as a Principal Researcher. He continues to be an Adjunct Professor at MIT. Madhu Sudan's research lies in the fields of computational complexity theory, algorithms and reliable communcation. He is best known for his works on probabilistic checking of proofs, and on the design of list-decoding algorithms for error-correcting codes. His current research interests include semantic communication and property testing. In 2002, Madhu Sudan was awarded the Nevanlinna Prize, for outstanding contributions to the mathematics of computer science, at the International Congress of Mathematicians in Beijing.

Link to schedule for Spring 2013. Fall 2012.