Effros Lab - Research Areas & Current Projects
The Effros Laboratory is focused on applying mathematical tools to problems of information transmission and storage in both technological and biological systems. In technological systems, this work involves demonstrating the limits of what is possible in information storage and communication systems and then working to devise practical new techniques to help us bridge the often enormous gap between what is possible and what today's technology can achieve. In biological systems, this work seeks to understand fundamental questions about how natural systems like our brains accomplish the goals of information storage and transmission and to use this understanding to expand the boundaries of science. At its heart, all of the work done in the Effros Lab relies on an understanding of the mathematics that explains how even seemingly unpredictable individual events achieve a kind of consistency and predictability when viewed in aggregate. The applications listed below are just some of the areas that may benefit from the fundamental research of Professor Effros' group.
Communication in a World of Uncertainty
Today's communication systems are faced with enormous variability. The number of users communicating over a given access point, such as a cell phone tower or WiFi hotspot, can increase enormously from one hour to the next and then go back down just as quickly. Wireless channel conditions can change with network congestion, weather, and external interference. Network use -- whether by people or the devices now scattered about our homes and communities -- is often erratic, as our uses of the network change. Each of these types of uncertainty presents its own special challenge. How can we achieve the best possible performance for current network conditions if we don't know what network conditions to expect?
Basics
With advances in communications media and technologies come corresponding needs for communication techniques that take full advantage of the capabilities particular to those technologies. A prime example of such an area of growth is the medium of internet communications. With the growth of internet communications comes an increased need for techniques whereby a single user can simultaneously communicate the same information to a wide array of other users with vastly varying bandwidth resources, computational capabilities, and performance requirements. Along with a variety of other factors, this need has helped inspire a surge in interest in multi-resolution or progressive transmission source coding.
Multi-resolution source codes are data compression algorithms in which simple, low-rate source descriptions are embedded in more complex, high-rate descriptions. Use of multi-resolution source codes allows users with severe bandwidth constraints or low performance requirements to achieve a low quality data representation by only incorporating a fraction of the original coded bit stream. Users with greater capabilities or needs can achieve more precise data representations by using larger fractions of the same bit stream. Further, users uncertain of their precision needs can progressively reconstruct the data to higher and higher accuracy -- stopping the communication process when the desired accuracy is achieved. Such coding techniques are extremely valuable in any application where multiple source descriptions at varying levels of precision are required.
Theory
The distortion-rate bound D(R) describes the lowest expected distortion achievable at expected rate R on a known source. Thus D(0.5) is the lowest distortion achievable at rate 0.5 bits per symbol and D(1.5) is the lowest distortion achievable at rate 1.5 bits per symbol in the above graph. The distortion-rate bound does not, however describe the optimal achievable performance for an L-resolution code with L > 1. For example, the distortion-rate function does not describe the lowest distortion achievable by adding 1 bit per symbol to a code achieving point (0.5, D(0.5)) in the above graph. Similarly, the distortion-rate function does not describe the lowest distortion achievable by reading only the first 0.5 bits per symbol from a code achieving point (1.5, D(1.5)) in the above graph. Further, the distortion-rate function does not describe all possible values of distortions D1 and D2 such that points (0.5,D1) and (1.5,D2) are achievable by the first- and second-resolution descriptions of a single two-resolution code. Theoretical work in the area of multi-resolution coding has focused on finding the space of rate and distortion vectors that can be achieved by a single multi-resolution code. The main contribution of this work was to derive equations describing the optimal performance theoretically achievable by fixed-rate multi-resolution source codes and the optimal performance theoretically achievable for variable-rate multi-resolution source codes for both stationary ergodic sources and for stationary nonergodic sources on complete separable metric spaces (which essentially describes all of the sources encountered in practical applications).
Vector Quantizers
Multi-resolution vector quantization is optimal multi-resolution source coding. In this case, the encoder blocks an incoming data stream into contiguous blocks of dimension n and chooses for each n-block a collection of n-dimensional reproductions. In particular, in an L-resolution code, the encoder chooses for each data block L reproductions of increasing resolution. The encoder describes the chosen collection of reproductions using a fixed- or variable-rate binary string such that the first portion of the binary string describes the resolution-1 reproduction, the second portion of the binary string describes the resolution-2 reproduction given the resolution-1 reproduction already described, and so on. The decoder decodes the desired portion of the binary string, updating its source reproduction as the binary descriptions for higher and higher resolution reconstructions become available.
Wavelet-based Codes
While multi-resolution vector quantizers are asymptotically optimal source codes -- that is they achieve performance arbitrarily close to theoretical limit as the dimension approaches infinity -- for practicality reasons, multi-resolution vector quantizers are typically implemented at very low vector dimensions. Giving up optimal encoders and decoders for ones that are merely good (and clever!) yields much of the high-dimension advantage without the computation, memory, and delay costs associated with high dimensional codes. Wavelets are an especially useful technique for achieving much of the high-dimensional advantage at very low complexity, and are therefore extremely popular in modern multi-resolution data compression algorithms. Ongoing work aims to combine the results of multi-resolution source coding theory with the techniques used in practical multi-resolution data compression algorithms like SPIHT, EZW, and their descendants.
Basics
Like traditional (single-resolution) source codes, multi-resolution source codes are data dependent. The optimal multi-resolution source code for a particular source guarantees good coding performance on that source, but may achieve poor performance on other sources. In the interest of designing source-independent multi-resolution source codes to achieve good performance across a broad class of possible sources, we have recently introduced the new field of universal multi-resolution source coding.
Theory
The goal in universal multi-resolution source coding is to design a single code that will -- in the long run -- do as well on each source in some class of sources as if it were designed for the source in operation, and to do so without prior knowledge of the source to be compressed. At first glance, universal multi-resolution source codes, like universal (single-resolution) source codes seem too good to be true. Yet early work in this nascent field has included not only proofs of the existence of these codes but also given bounds on their performance.
Practice
While universal multi-resolution source coding theory proves the existence of universal multi-resolution source codes, the proofs are not constructive. That is, the proofs demonstrate the existence of good codes without describing how to find those codes in practice. Recent work in practical universal multi-resolution source coding treats the problem of optimal universal multi-resolution source code design. This work takes an approach similar to that used in the weighted universal vector quantization algorithm, but generalizes that approach by building a collection of multi-resolution source codes rather than a collection of single-resolution source codes and by allowing the first-stage source description itself to be multi-resolution in nature.
Wireless Communications
Wireless communications systems suffer a number of difficulties caused by the great system uncertainties to which they are subjected. For example, the noise characteristics of a wireless communication channel can vary widely and unpredictably. As a user travels through space and time, the interference, distance to base station, and competing network traffic experienced by the wireless unit may all vary enormously. Devising channel coding strategies to deal with these changing channel characteristics presents a wide array of new challenges. For example, as the channel characteristics change, so too does the maximal rate at which information can be reliably communicated across the channel. Thus good channel codes for wireless communications systems must be adaptive, or at least robust, to channel variations.
Yet developing good channel codes for wireless communications systems, may have consequences not only for channel code design but also for source coding design, leading to a variety of joint-source channel coding issues. For example, if the channel coding rate is allowed to vary over time, then the source coding rate must also vary to make maximum use of the available channel coding rate. Multiple resolution source codes are particularly well-suited to this application, since they allow for varying rate but achieve that varying rate with a single source code, rather than a family of source codes, as would otherwise be required.
An alternative example of the interaction between source and channel codes in wireless communications systems may be seen in wireless data communication systems. Wireless data systems are typically implemented as packet-based networks in order to allow for more users to simultaneously use the same bandwidth. In packet-based networks, information is broken into small pieces or "packets", each of which is carefully labelled and independently sent over the wireless link. Due to network congestion or interference, some of the packets may be lost in transmission, and designing source codes for which data can be reconstructed appropriately even when some subset of the packets is never received (or received too late) at the decoder requires a special form of source codes, known as multiple description source codes.
Basics
Current work in channel coding studies both the capacities and the strategies by which those capacities can be achieved for a variety of channels relavent to network and wireless communications applications. Examples of channels under consideration include the broadcast channel, over which a single user communicates information to a collection of recipients, and a variety of models for time-varying channels. Both of these families of channels play central roles in a wide variety of wireless and networked communications applications.
Theory
One example of ongoing work on the topic of channel coding theory concerns the study of broadcast channel capacities. While the capacity of broadcast channels has been studied in the literature for several decades, the solution to the broadcast capacity problem for general broadcast channels remains unsolved. Ongoing work in channel coding theory proves the capacity of Gaussian broadcast channels with finite intersymbol interference and demonstrates an optimal strategy for achieving that strategy.
Basics
An end-to-end communication system is composed of a system encoder, which maps the source symbols into channel inputs, and a system decoder, which maps the channel outputs into noisy reproductions of the original source symbols. The system encoder can be further broken down into a source encoder, which maps the source symbols into an intermediate alphabet, typically a set of binary strings, and a channel encoder, which maps the binary strings into coded bits or waveforms for transmission over the channel. Similarly, the system decoder can be broken down into a channel decoder and a source decoder corresponding to the respective channel and source encoders. Any system encoder-decoder pair can be represented in this manner, although the breakdown is not unique.
Shannon's classical separation result states that we can optimize the end-to-end system design by separately optimizing the source encoder-decoder pair and the channel encoder-decoder pair. However, this result holds only in the limit of infinite source code dimension and infinite channel code block length. Shannon theory does not provide a design algorithm for good channel codes with finite block length. In addition, Shannon theory does not address the design of good source codes when the probability of channel error is nonzero, which is inevitable for finite-length channel codes. Thus, for practical systems, a joint source and channel code design may reduce distortion, as well as complexity and delay.
Current work in joint source and channel code design combines techniques for optimizing source codes to match channel noise characteristics with techniques for optimizing channel codes to match source code characteristics into a single joint design aimed at minimizing the end-to-end distortion experienced in the communication system as a whole. The most important factor in that joint design seems to be the bit allocation between the source coding bits (used to describe the original data string) and the channel coding bits (used to protect the source coding bits from channel errors). The resulting codes may be employed on a wide range of channels (e.g., AWGN channels and Rayleigh fading channels with and without receiver side information) with significant performance benefits.
Basics
Multiple description source codes are source codes designed for systems where source reconstruction based on only a portion of the original data stream may be required. Multiple description codes are useful for applications like packet-based networks (this includes wireline systems like the internet as well as wireless systems including the majority of the wireless data systems), diversity communication systems (e.g., communications systems using antenna diversity), and distributed data storage systems. For example, in a packet-based communication system, where data is broken into small segments, each of which is separately sent through the channel, we would like to be able to reconstruct our signal under a full range of packet-loss scenarios. Similarly, in diversity channel systems, where a signal is sent from source to destination over a collection of channels, we would like to be able to reconstruct some version of our data string even when one or more of the channels in our collection fails.
Likewise in distributed data storage systems, where data storage is distributed across a number of hosts, we would like to be able to reconstruct the stored data even when some subset of the hosts is unavailable. In each of these applications, a multiple description code could be designed to allow reconstruction based on the received partial information. That is, we could design multiple description codes to allow source reconstruction under any packet-loss scenario in a packet-based network, under any channel failure condition in a diversity systems, and under any host availability situation in a distributed data storage system. (Multi-resolution source codes, which allow for reconstruction under a subset of these conditions, may be thought of as a special case of multiple description codes.)
Theory
In designing a multiple description code, we would like for the distortion achieved in a source reconstruction based on any possible subset of packets to be as close as possible to the lowest possible distortion at the received rate. Unfortunately, these best possible performances are not simultaneously achievable, as can be seen immediately from the multi-resolution source coding example. The goal in the theoretical research into the mutliple description problem is to completely describe the space of received rates and distortions achievable by a multiple description source code. Current results in this area include bounds on this space, but existing bounds are not yet tight.
Practice
Ongoing work in multiple description source coding concerns the design of practical multiple description source codes both within the optimal coding paradigm provided by vector quantizers and within other coding settings such as wavelet-based codes. New coding structures for general multiple description codes developed in the lab include a novel ternary code tree for multiple description source code design. Another recent advance in this area is the introduction of a new performance criterion for the design of multiple description codes. This criterion allows for the inclusion of a probability or priority function over the full range of packet-loss scenarios within a mutliple description coding system.
Speech Recognition
Many of the techniques used to create robust data compression systems may be generalized for creating robust systems for other applications. A current study ongoing in the Data Compression Laboratory treats the problem of speech recognition. In particular, the goal of this project is to develop speech recognition systems that achieve good speech recognition performance despite the enormous inter- and intra-speaker variations that occur in natural speech.
Universal Data Compression
In the area of personal telecommunications, where transmission channel capacity is often severely limited by power and bandwidth constraints, it is particularly important to compress the data maximally before transmission. That is, it is important to compress the data to a bit rate that is as close as possible to the rate-distortion bound R(D), which is fundamentally the lowest possible rate to which the source of data can be compressed without distorting the data by more than some amount D (D may be zero in the case of lossless compression). The rate-distortion bound and the data compression algorithms that achieve it depend on the precise source statistics. For example, the Huffman code, which is a lossless compression scheme, is designed for a particular probability distribution of source symbols. The same can be said for lossy compression schemes or quantizers. Standard implementations of JPEG, for example, match their single bit allocation strategy to the expected class of source statistics. If a lossless code or a quantizer is designed for one source and operated on another, the rate-distortion bound for the source in operation is not achieved. Unfortunately, especially in personal telecommunications, the exact source statistics are almost never known in advance and may indeed be time-varying. Hence it is important to study data compression schemes that are robust or adaptive to the source statistics.
Universal data compression is data compression that asymptotically achieves the rate-distortion bound as more and more data is processed, regardless of the source (within a class of sources). By observing more and more data, universal compressors essentially learn the statistics of the source in operation. This must be done so that optimal performance is eventually achieved, with convergence to the optimum at the fastest possible rate.
Theory
Given a class of source distributions, a universal quantizer is a sequence of codes for which the expected performance on any single source in that class approaches, as the length of the incoming data stream increases to infinity, the rate-distortion bound associated with the source in operation. Work on the theory of universal source codes includes investigations into conditions on their existence and on the rates of convergence of a variety of sequential and nonsequential algorithms.
In the attempt to design codes which meet the above criteria for robustness across a class of sources, one approach involves the use of multi-codebook compression systems. A multi-codebook compression system employs a collection of source codes. In using a multi-codebook compression system, each length-N block of source data is encoded with a single member of the collection, but the code used is allowed to vary from block to block. While any single code can achieve maximal compression on only a single source, careful design of multi-codebook compression systems can result in good performance over a class of source distributions.
Multi-codebook quantization systems involve two stages: the first stage maps the incoming data block to a particular quantizer or codebook, and the second stage describes the data block given the chosen codebook. For optimal coding, the codes in the collection should be vector quantizers. For (sub-optimal) codes giving better performance for typical complexity constraints, transform or wavelet-based codes may be substituted for vector quantizers.
The Burrows Wheeler Transform (BWT) is a slightly expansive reversible sequence transformation currently receiving considerable attention from researchers interested in practical lossless data compression. Experimental results on algorithms using this transformation indicate lossless coding rates better than those achieved by Ziv-Lempel codes (LZ'77 and LZ'78) but typically not quite as good as those achieved by the prediction by partial mapping (PPM) schemes. To date, these comparisons have primarily involved experimental studies of implementation speeds and achieved lossless description rates. Ongoing work in the Data Compression Lab involves the design of new universal lossless data compression algorithms based on the Burrows Wheeler Transform and the theoretical analysis of these and existing BWT-based schemes. The algorithms introduced are extremely efficient in their computation and memory requirements and achieve excellent performance.
Lossless Source Coding
Basics
Lossless or "noiseless" source coding is data compression for applications where the data must be represented efficiently but cannot be altered or distorted in the compression process. Data types that typically require lossless coding include text and computer exectuables, where any variation in the original data string may have catastrophic consequences. Lossless data compression is employed in a wide range of technologies, including data storage devices, modems, and fax machines.
Theory
The goal in studying lossless source coding theory is both to understand the best performance that could conceivably be achieved in lossless source coding and to understand how well current algorithms perform relative to those bounds. Current work in the Data Compression Laboratory focuses on the analysis of new and existing low-complexity universal lossless codes.
Practice
While theoretical analyses are useful for bounding the expected performance of a variety of lossless source codes, they do not tell the full story about algorithms considered. Work currently under way considers the design and implementation of practical universal lossless source codes and the comparison of their performances on text data to the performance of competing schemes on the same data sets.