Projects

About / Projects

The Effros Laboratory is focused on applying mathematical tools to problems of information transmission and storage in both technological and biological systems. In technological systems, this work involves demonstrating the limits of what is possible in information storage and communication systems and then working to devise practical new techniques to help us bridge the often enormous gap between what is possible and what today's technology can achieve. In biological systems, this work seeks to understand fundamental questions about how natural systems like our brains accomplish the goals of information storage and transmission and to use this understanding to expand the boundaries of science. At its heart, all of the work done in the Effros Lab relies on an understanding of the mathematics that explains how even seemingly unpredictable individual events achieve a kind of consistency and predictability when viewed in aggregate. The applications listed below are just some of the areas that may benefit from the fundamental research of Professor Effros' group.

Communication in a World of Uncertainty

Today's communication systems are faced with enormous variability. The number of users communicating over a given access point, such as a cell phone tower or WiFi hotspot, can increase enormously from one hour to the next and then go back down just as quickly. Wireless channel conditions can change with network congestion, weather, and external interference. Network use -- whether by people or the devices now scattered about our homes and communities -- is often erratic, as our uses of the network change. Each of these types of uncertainty presents its own special challenge. How can we achieve the best possible performance for current network conditions if we don't know what network conditions to expect?

Basics

With advances in communications media and technologies come corresponding needs for communication techniques that take full advantage of the capabilities particular to those technologies. A prime example of such an area of growth is the medium of internet communications. With the growth of internet communications comes an increased need for techniques whereby a single user can simultaneously communicate the same information to a wide array of other users with vastly varying bandwidth resources, computational capabilities, and performance requirements. Along with a variety of other factors, this need has helped inspire a surge in interest in multi-resolution or progressive transmission source coding.

Multi-resolution source codes are data compression algorithms in which simple, low-rate source descriptions are embedded in more complex, high-rate descriptions. Use of multi-resolution source codes allows users with severe bandwidth constraints or low performance requirements to achieve a low quality data representation by only incorporating a fraction of the original coded bit stream. Users with greater capabilities or needs can achieve more precise data representations by using larger fractions of the same bit stream. Further, users uncertain of their precision needs can progressively reconstruct the data to higher and higher accuracy -- stopping the communication process when the desired accuracy is achieved. Such coding techniques are extremely valuable in any application where multiple source descriptions at varying levels of precision are required.

Theory

The distortion-rate bound D(R) describes the lowest expected distortion achievable at expected rate R on a known source. Thus D(0.5) is the lowest distortion achievable at rate 0.5 bits per symbol and D(1.5) is the lowest distortion achievable at rate 1.5 bits per symbol in the above graph. The distortion-rate bound does not, however describe the optimal achievable performance for an L-resolution code with L > 1. For example, the distortion-rate function does not describe the lowest distortion achievable by adding 1 bit per symbol to a code achieving point (0.5, D(0.5)) in the above graph. Similarly, the distortion-rate function does not describe the lowest distortion achievable by reading only the first 0.5 bits per symbol from a code achieving point (1.5, D(1.5)) in the above graph. Further, the distortion-rate function does not describe all possible values of distortions D1 and D2 such that points (0.5,D1) and (1.5,D2) are achievable by the first- and second-resolution descriptions of a single two-resolution code. Theoretical work in the area of multi-resolution coding has focused on finding the space of rate and distortion vectors that can be achieved by a single multi-resolution code. The main contribution of this work was to derive equations describing the optimal performance theoretically achievable by fixed-rate multi-resolution source codes and the optimal performance theoretically achievable for variable-rate multi-resolution source codes for both stationary ergodic sources and for stationary nonergodic sources on complete separable metric spaces (which essentially describes all of the sources encountered in practical applications).

Vector Quantizers

Multi-resolution vector quantization is optimal multi-resolution source coding. In this case, the encoder blocks an incoming data stream into contiguous blocks of dimension n and chooses for each n-block a collection of n-dimensional reproductions. In particular, in an L-resolution code, the encoder chooses for each data block L reproductions of increasing resolution. The encoder describes the chosen collection of reproductions using a fixed- or variable-rate binary string such that the first portion of the binary string describes the resolution-1 reproduction, the second portion of the binary string describes the resolution-2 reproduction given the resolution-1 reproduction already described, and so on. The decoder decodes the desired portion of the binary string, updating its source reproduction as the binary descriptions for higher and higher resolution reconstructions become available.

Wavelet-based Codes

While multi-resolution vector quantizers are asymptotically optimal source codes -- that is they achieve performance arbitrarily close to theoretical limit as the dimension approaches infinity -- for practicality reasons, multi-resolution vector quantizers are typically implemented at very low vector dimensions. Giving up optimal encoders and decoders for ones that are merely good (and clever!) yields much of the high-dimension advantage without the computation, memory, and delay costs associated with high dimensional codes. Wavelets are an especially useful technique for achieving much of the high-dimensional advantage at very low complexity, and are therefore extremely popular in modern multi-resolution data compression algorithms. Ongoing work aims to combine the results of multi-resolution source coding theory with the techniques used in practical multi-resolution data compression algorithms like SPIHT, EZW, and their descendants.

Wireless Communications

Wireless communications systems suffer a number of difficulties caused by the great system uncertainties to which they are subjected. For example, the noise characteristics of a wireless communication channel can vary widely and unpredictably. As a user travels through space and time, the interference, distance to base station, and competing network traffic experienced by the wireless unit may all vary enormously. Devising channel coding strategies to deal with these changing channel characteristics presents a wide array of new challenges. For example, as the channel characteristics change, so too does the maximal rate at which information can be reliably communicated across the channel. Thus good channel codes for wireless communications systems must be adaptive, or at least robust, to channel variations.

Yet developing good channel codes for wireless communications systems, may have consequences not only for channel code design but also for source coding design, leading to a variety of joint-source channel coding issues. For example, if the channel coding rate is allowed to vary over time, then the source coding rate must also vary to make maximum use of the available channel coding rate. Multiple resolution source codes are particularly well-suited to this application, since they allow for varying rate but achieve that varying rate with a single source code, rather than a family of source codes, as would otherwise be required.

An alternative example of the interaction between source and channel codes in wireless communications systems may be seen in wireless data communication systems. Wireless data systems are typically implemented as packet-based networks in order to allow for more users to simultaneously use the same bandwidth. In packet-based networks, information is broken into small pieces or "packets", each of which is carefully labelled and independently sent over the wireless link. Due to network congestion or interference, some of the packets may be lost in transmission, and designing source codes for which data can be reconstructed appropriately even when some subset of the packets is never received (or received too late) at the decoder requires a special form of source codes, known as multiple description source codes.

Basics

An end-to-end communication system is composed of a system encoder, which maps the source symbols into channel inputs, and a system decoder, which maps the channel outputs into noisy reproductions of the original source symbols. The system encoder can be further broken down into a source encoder, which maps the source symbols into an intermediate alphabet, typically a set of binary strings, and a channel encoder, which maps the binary strings into coded bits or waveforms for transmission over the channel. Similarly, the system decoder can be broken down into a channel decoder and a source decoder corresponding to the respective channel and source encoders. Any system encoder-decoder pair can be represented in this manner, although the breakdown is not unique.
Shannon's classical separation result states that we can optimize the end-to-end system design by separately optimizing the source encoder-decoder pair and the channel encoder-decoder pair. However, this result holds only in the limit of infinite source code dimension and infinite channel code block length. Shannon theory does not provide a design algorithm for good channel codes with finite block length. In addition, Shannon theory does not address the design of good source codes when the probability of channel error is nonzero, which is inevitable for finite-length channel codes. Thus, for practical systems, a joint source and channel code design may reduce distortion, as well as complexity and delay.
Current work in joint source and channel code design combines techniques for optimizing source codes to match channel noise characteristics with techniques for optimizing channel codes to match source code characteristics into a single joint design aimed at minimizing the end-to-end distortion experienced in the communication system as a whole. The most important factor in that joint design seems to be the bit allocation between the source coding bits (used to describe the original data string) and the channel coding bits (used to protect the source coding bits from channel errors). The resulting codes may be employed on a wide range of channels (e.g., AWGN channels and Rayleigh fading channels with and without receiver side information) with significant performance benefits.

Basics

Multiple description source codes are source codes designed for systems where source reconstruction based on only a portion of the original data stream may be required. Multiple description codes are useful for applications like packet-based networks (this includes wireline systems like the internet as well as wireless systems including the majority of the wireless data systems), diversity communication systems (e.g., communications systems using antenna diversity), and distributed data storage systems. For example, in a packet-based communication system, where data is broken into small segments, each of which is separately sent through the channel, we would like to be able to reconstruct our signal under a full range of packet-loss scenarios. Similarly, in diversity channel systems, where a signal is sent from source to destination over a collection of channels, we would like to be able to reconstruct some version of our data string even when one or more of the channels in our collection fails.

Likewise in distributed data storage systems, where data storage is distributed across a number of hosts, we would like to be able to reconstruct the stored data even when some subset of the hosts is unavailable. In each of these applications, a multiple description code could be designed to allow reconstruction based on the received partial information. That is, we could design multiple description codes to allow source reconstruction under any packet-loss scenario in a packet-based network, under any channel failure condition in a diversity systems, and under any host availability situation in a distributed data storage system. (Multi-resolution source codes, which allow for reconstruction under a subset of these conditions, may be thought of as a special case of multiple description codes.)

Theory

In designing a multiple description code, we would like for the distortion achieved in a source reconstruction based on any possible subset of packets to be as close as possible to the lowest possible distortion at the received rate. Unfortunately, these best possible performances are not simultaneously achievable, as can be seen immediately from the multi-resolution source coding example. The goal in the theoretical research into the mutliple description problem is to completely describe the space of received rates and distortions achievable by a multiple description source code. Current results in this area include bounds on this space, but existing bounds are not yet tight.

Practice

Ongoing work in multiple description source coding concerns the design of practical multiple description source codes both within the optimal coding paradigm provided by vector quantizers and within other coding settings such as wavelet-based codes. New coding structures for general multiple description codes developed in the lab include a novel ternary code tree for multiple description source code design. Another recent advance in this area is the introduction of a new performance criterion for the design of multiple description codes. This criterion allows for the inclusion of a probability or priority function over the full range of packet-loss scenarios within a mutliple description coding system.

Speech Recognition

Many of the techniques used to create robust data compression systems may be generalized for creating robust systems for other applications. A current study ongoing in the Data Compression Laboratory treats the problem of speech recognition. In particular, the goal of this project is to develop speech recognition systems that achieve good speech recognition performance despite the enormous inter- and intra-speaker variations that occur in natural speech.

Universal Data Compression

In the area of personal telecommunications, where transmission channel capacity is often severely limited by power and bandwidth constraints, it is particularly important to compress the data maximally before transmission. That is, it is important to compress the data to a bit rate that is as close as possible to the rate-distortion bound R(D), which is fundamentally the lowest possible rate to which the source of data can be compressed without distorting the data by more than some amount D (D may be zero in the case of lossless compression). The rate-distortion bound and the data compression algorithms that achieve it depend on the precise source statistics. For example, the Huffman code, which is a lossless compression scheme, is designed for a particular probability distribution of source symbols. The same can be said for lossy compression schemes or quantizers. Standard implementations of JPEG, for example, match their single bit allocation strategy to the expected class of source statistics. If a lossless code or a quantizer is designed for one source and operated on another, the rate-distortion bound for the source in operation is not achieved. Unfortunately, especially in personal telecommunications, the exact source statistics are almost never known in advance and may indeed be time-varying. Hence it is important to study data compression schemes that are robust or adaptive to the source statistics.

Universal data compression is data compression that asymptotically achieves the rate-distortion bound as more and more data is processed, regardless of the source (within a class of sources). By observing more and more data, universal compressors essentially learn the statistics of the source in operation. This must be done so that optimal performance is eventually achieved, with convergence to the optimum at the fastest possible rate.

Theory

Given a class of source distributions, a universal quantizer is a sequence of codes for which the expected performance on any single source in that class approaches, as the length of the incoming data stream increases to infinity, the rate-distortion bound associated with the source in operation. Work on the theory of universal source codes includes investigations into conditions on their existence and on the rates of convergence of a variety of sequential and nonsequential algorithms.

Lossless Source Coding

Basics

Lossless or "noiseless" source coding is data compression for applications where the data must be represented efficiently but cannot be altered or distorted in the compression process. Data types that typically require lossless coding include text and computer exectuables, where any variation in the original data string may have catastrophic consequences. Lossless data compression is employed in a wide range of technologies, including data storage devices, modems, and fax machines.

Theory

The goal in studying lossless source coding theory is both to understand the best performance that could conceivably be achieved in lossless source coding and to understand how well current algorithms perform relative to those bounds. Current work in the Data Compression Laboratory focuses on the analysis of new and existing low-complexity universal lossless codes.

Practice

While theoretical analyses are useful for bounding the expected performance of a variety of lossless source codes, they do not tell the full story about algorithms considered. Work currently under way considers the design and implementation of practical universal lossless source codes and the comparison of their performances on text data to the performance of competing schemes on the same data sets.

Effros Lab - Research Areas & Current Projects

Communication in a World of Uncertainty

Multi-Resolution Source Coding

Basics

Theory

Vector Quantizers

Wavelet-based Codes

Universal Multi-Resolution Source Coding

Basics

Theory

Practice

Wireless Communications

Channel Coding

Basics

Theory

Joint-Source Channel Coding

Basics

Multi-Description Source Codes

Basics

Theory

Practice

Speech Recognition

Universal Data Compression

Theory

Two-Stage Codes

Lossless Codes (Burrow Wheeler Transform)

Lossless Source Coding

Basics

Theory

Practice