What flies can teach us about searching the web

Zheng, Lauritzen et al., Cell Resource (2018)]

Editor's Introduction

A neural algorithm for a fundamental computing problem

annotated by

Have you ever wondered how Google decides what search results to show you? Or how Netflix recommends shows for you to watch? These processes rely on complex computer algorithms, and scientists are constantly working to improve how they function. As it turns out, the answer may come in a surprisingly small package—the fruit fly, that is! In this paper, researchers explore how fruit fly brains identify and respond to smells—and their findings could have huge implications for how we search, sort, and store information in our daily lives.

Paper Details

Original title
A neural algorithm for a fundamental computing problem
Original publication date
Reference
Vol. 358, Issue 6364, pp. 793-796
Issue name
Science
DOI
10.1126/science.aam9868

Abstract

Similarity search—for example, identifying similar images in a database or similar documents on the web—is a fundamental computing problem faced by large-scale information retrieval systems. We discovered that the fruit fly olfactory circuit solves this problem with a variant of a computer science algorithm (called locality-sensitive hashing). The fly circuit assigns similar neural activity patterns to similar odors, so that behaviors learned from one odor can be applied when a similar odor is experienced. The fly algorithm, however, uses three computational strategies that depart from traditional approaches. These strategies can be translated to improve the performance of computational similarity searches. This perspective helps illuminate the logic supporting an important sensory function and provides a conceptually new algorithm for solving a fundamental computational problem.

Video. The way that fruit flies identify smells that are most similar to ones they've previously encountered could improve computer search algorithms (Courtesy Salk Institute).

 

Report

An essential task of many neural circuits is to generate neural activity patterns in response to input stimuli, so that different inputs can be specifically identified. We studied the circuit used to process odors in the fruit fly olfactory system and uncovered computational strategies for solving a fundamental machine learning problem: approximate similarity (or nearest-neighbors) search.

The fly olfactory circuit generates a “tag” for each odor, which is a set of neurons that fire when that odor is presented (1). This tag is critical for learning behavioral responses to different odors (2). For example, if a reward (e.g., sugar water) or a punishment (e.g., electric shock) is associated with an odor, that odor becomes attractive (a fly will approach the odor) or repulsive (a fly will avoid the odor), respectively. The tags assigned to odors are sparse—only a small fraction of the neurons that receive odor information respond to each odor (35)—and nonoverlapping: Tags for two randomly selected odors share few, if any, active neurons, so that different odors can be easily distinguished (1).

The tag for an odor is computed by a three-step procedure (Fig. 1A). The first step involves feedforward connections from odorant receptor neurons (ORNs) in the fly’s nose to projection neurons (PNs) in structures called glomeruli. There are 50 ORN types, each with a different sensitivity and selectivity for different odors. Thus, each input odor has a location in a 50-dimensional space determined by the 50 ORN firing rates. For each odor, the distribution of ORN firing rates across the 50 ORN types is exponential, with a mean that depends on the concentration of the odor (6, 7). For the PNs, this concentration dependence is removed (7, 8); that is, the distribution of firing rates across the 50 PN types is exponential, with close to the same mean for all odors and all odor concentrations (1). Thus, the first step in the circuit essentially “centers the mean”—a standard preprocessing step in many computational pipelines—using a technique called divisive normalization (8). This step is important so that the fly does not mix up odor intensity with odor type.

Figure 1

Fig. 1. Mapping between the fly olfactory circuit and locality-sensitive hashing (LSH). (A) Schematic of the fly olfactory circuit. In step 1, 50 ORNs in the fly’s nose send axons to 50 PNs in the glomeruli; as a result of this projection, each odor is represented by an exponential distribution of firing rates, with the same mean for all odors and all odor concentrations. In step 2, the PNs expand the dimensionality, projecting to 2000 KCs connected by a sparse, binary random projection matrix. In step 3, the KCs receive feedback inhibition from the anterior paired lateral (APL) neuron, which leaves only the top 5% of KCs to remain firing spikes for the odor. This 5% corresponds to the tag (hash) for the odor. (B) Illustrative odor responses. Similar pairs of odors (e.g., methanol and ethanol) are assigned more similar tags than are dissimilar odors. Darker shading denotes higher activity. (C) Differences between conventional LSH and the fly algorithm. In the example, the computational complexity for LSH and the fly are the same. The input dimensionality d = 5. LSH computes m = 3 random projections, each of which requires 10 operations (five multiplications plus five additions). The fly computes m = 15 random projections, each of which requires two addition operations. Thus, both require 30 total operations. x, input feature vector; r, Gaussian random variable; w, bin width constant for discretization.

Panel A

This panel shows a schematic (a symbolic and simplified diagram) of the different steps the fly olfactory circuit takes to identify odors. It also identifies the specific neurons and types of computations involved in the process.

Panel B

This panel shows a schematic of how the top 5% of Kenyon cells (KCs) form odor tags. It also illustrates how tags look more similar for odors with a similar molecular makeup.

Panel C

The panel shows a schematic comparing how the fly and locality-sensitive hashing (LSH) algorithms compute information. It also demonstrates how the authors fixed the computational complexity of the two algorithms so that they used the same number of mathematical operations.

The second step, where the main algorithmic insight begins, involves a 40-fold expansion in the number of neurons: Fifty PNs project to 2000 Kenyon cells (KCs), connected by a sparse, binary random connection matrix (9). Each KC receives and sums the firing rates from about six randomly selected PNs (9). The third step involves a winner-take-all (WTA) circuit in which strong inhibitory feedback comes from a single inhibitory neuron, called APL (anterior paired lateral neuron). As a result, all but the highest-firing 5% of KCs are silenced (1, 3, 4). The firing rates of these remaining 5% correspond to the tag assigned to the input odor.

From a computer science perspective, we view the fly’s circuit as a hash function, whose input is an odor and whose output is a tag (called a hash) for that odor. Although tags should discriminate odors, it is also to the fly’s advantage to associate very similar odors with similar tags (Fig. 1B), so that conditioned responses learned for one odor can be applied when a very similar odor, or a noisy version of the learned odor, is experienced. This led us to conjecture that the fly’s circuit produces tags that are locality-sensitive; that is, the more similar a pair of odors (as defined by the 50 ORN firing rates for that odor), the more similar their assigned tags. Locality-sensitive hash [LSH (10, 11)] functions serve as the foundation for solving numerous similarity search problems in computer science. We translated insights from the fly’s circuit to develop a class of LSH algorithms for efficiently finding approximate nearest neighbors of high-dimensional points.

Imagine that you are provided an image of an elephant and seek to find the 100 images—out of the billions of images on the web—that look most similar to your elephant image. This is called the nearest-neighbors search problem, which is of fundamental importance in information retrieval, data compression, and machine learning (10). Each image is typically represented as a d-dimensional vector of feature values. (Each odor that a fly processes is a 50-dimensional feature vector of firing rates.) A distance metric is used to compute the similarity between two images (feature vectors), and the goal is to efficiently find the nearest neighbors of any query image. If the web contained only a few images, then brute force linear search could easily be used to find the exact nearest neighbors. If the web contained many images, but each image was represented by a low-dimensional vector (e.g., 10 or 20 features), then space-partitioning methods (12) would similarly suffice. However, for large databases with high-dimensional data, neither approach scales (11).

In many applications, returning an approximate set of nearest neighbors that are “close enough” to the query is adequate, so long as they can be found quickly. This has motivated an approach for finding approximate nearest neighbors by LSH (10). For the fly, as noted, the locality-sensitive property states that two odors that generate similar ORN responses will be represented by two tags that are themselves similar (Fig. 1B). Likewise, for image search, the tag of an elephant image will be more similar to the tag of another elephant image than to the tag of a skyscraper image.

Unlike a traditional (non-LSH) hash function, where the input points are scattered randomly and uniformly over the range, a LSH function provides a distance-preserving embedding of points from d-dimensional space into m-dimensional space (the latter corresponds to the tag). Thus, points that are closer to one another in input space have a higher probability of being assigned the same or a similar tag than points that are far apart. [A formal definition is given in (13).]

To design a LSH function, one common trick is to compute random projections of the input data (10, 11)—that is, to multiply the input feature vector by a random matrix. The Johnson-Lindenstrauss lemma (14, 15) and its many variants (1618) provide strong theoretical bounds on how well locality is preserved when embedding data from d into m dimensions by using various types of random projections.

The fly also assigns tags to odors through random projections (step 2 in Fig. 1A; 50 PNs 2000 KCs), which provides a key clue to the function of this part of the circuit. There are, however, three differences between the fly algorithm and conventional LSH algorithms. First, the fly uses sparse, binary random projections, whereas LSH functions typically use dense, Gaussian random projections that require many more mathematical operations to compute. Second, the fly expands the dimensionality of the input after projection (d « m), whereas LSH reduces the dimensionality (d » m). Third, the fly sparsifies the higher-dimensionality representation by a WTA mechanism, whereas LSH preserves a dense representation.

In the supplementary materials (13), we show analytically that sparse, binary random projections of the type in the fly olfactory circuit generate tags that preserve the neighborhood structure of input points. This proves that the fly’s circuit represents a previously unknown LSH family.

We then empirically evaluated the fly algorithm versus traditional LSH (10, 11) on the basis of how precisely each algorithm could identify nearest neighbors of a given query point. To perform a fair comparison, we fixed the computational complexity of both algorithms to be the same (Fig. 1C). That is, the two approaches were fixed to use the same number of mathematical operations to generate a hash of length k (i.e., a vector with k nonzero values) for each input (13).

We compared the two algorithms for finding nearest neighbors in three benchmark data sets: SIFT (d = 128), GLOVE (d = 300), and MNIST (d = 784) (13). SIFT and MNIST both contain vector representations of images used for image similarity search, whereas GLOVE contains vector representations of words used for semantic similarity search. We used a subset of each data set with 10,000 inputs each, in which each input was represented as a feature vector in d-dimensional space. To test performance, we selected 1000 random query inputs from the 10,000 and compared true versus predicted nearest neighbors. That is, for each query, we found the top 2% (200) of its true nearest neighbors in input space, determined on the basis of Euclidean distance between feature vectors. We then found the top 2% of predicted nearest neighbors in m-dimensional hash space, determined on the basis of the Euclidean distance between tags (hashes). We varied the length of the hash (k) and computed the overlap between the ranked lists of true and predicted nearest neighbors by using the mean average precision (19). We averaged the mean average precision over 50 trials, in which, for each trial, the random projection matrices and the queries changed. We isolated each of the three differences between the fly algorithm and LSH to test their individual effect on nearest-neighbors retrieval performance.

Replacing the dense Gaussian random projection of LSH with a sparse binary random projection did not hurt how precisely nearest neighbors could be identified (Fig. 2A). These results support our theoretical calculations that the fly’s random projection is locality-sensitive. Moreover, the sparse, binary random projection achieved a computational savings of a factor of 20 relative to the dense, Gaussian random projection (fig. S1) (13).

Figure 2

Fig. 2. Empirical comparison of different random projection types and tag-selection methods. In all plots, the x axis is the length of the hash, and the y axis is the mean average precision denoting how accurately the true nearest neighbors are found (higher is better). (A) Sparse, binary random projections offer near-identical performance to that of dense, Gaussian random projections, but the former provide a large savings in computation. (B) The expanded-dimension (from k to 20k) plus winner-take-all (WTA) sparsification further boosts performance relative to non-expansion. Results are consistent across all three benchmark data sets. Error bars indicate standard deviation over 50 trials.

Panel A

This panel shows how the performance of the LSH algorithm changed for each data set under two different conditions: the type of projection the authors used and the length of the hash (tag). The results indicate that the longer the hash length, the better the algorithm’s performance. On the other hand, the type of projection made no difference in the algorithm’s performance.

Panel B

This panel shows how the performance of the LSH algorithm changed when the authors increased the number of projections. Solely increasing the dimensionality of the projections from k to 20k did not improve performance in any of the data sets. However, increasing the dimensionality and using a “winner takes all” approach to sparsify the tag (as opposed to randomly picking tag elements) did improve performance in all three data sets.

When expanding the dimensionality, sparsifying the tag using WTA resulted in better performance than using random tag selection (Fig. 2B). WTA selected the top k firing KCs as the tag, unlike random tag selection, which selected k random KCs. For both, we used 20k random projections for the fly to equate the number of mathematical operations used by the fly and LSH (13). For example, for the SIFT data set with hash length k = 4, random selection yielded a 17.7% mean average precision, versus roughly double that (32.4%) using WTA. Thus, selecting the top firing neurons best preserves relative distances between inputs; the increased dimensionality also makes it easier to segregate dissimilar inputs. For random tag selection, we selected k random (but fixed for all inputs) KCs for the tag; hence, its performance is effectively the same as doing k random projections, as in LSH.

With further expansion of the dimensionality (from 20k to 10d KCs, closer to the actual fly’s circuitry), we obtained further gains relative to LSH in identifying nearest neighbors across all data sets and hash lengths (Fig. 3). The gains were highest for very short hash lengths, where there was an almost threefold improvement in mean average precision (e.g., for MNIST with k = 4, 16.0% for LSH, versus 44.8% for the fly algorithm).

Figure 3

Fig. 3. Overall comparison between the fly algorithm and LSH. In all plots, the x axis is the length of the hash, and the y axis is the mean average precision (higher is better). A 10d expansion was used for the fly. Across all three data sets, the fly’s method outperforms LSH, most prominently for short hash lengths. Error bars indicate standard deviation over 50 trials.

Description

Figure 3 compares the performance of the LSH and fly algorithms in each of the three data sets. In all cases, the fly algorithm offers superior performance regardless of hash length, although the difference is greatest for short hash lengths. Also, longer hash lengths are associated with better performance for both algorithms.

We also found similar gains in performance when testing the fly algorithm in higher dimensions and for binary LSH (20) (figs. S2 to S3). Thus, the fly algorithm is scalable and may be useful across other LSH families.

Our work identified a synergy between strategies for similarity matching in the brain (21) and hashing algorithms for nearest-neighbors search in large-scale information retrieval systems. It may also have applications in duplicate detection, clustering, and energy-efficient deep learning (22). There are numerous extensions to LSH (23), including the use of multiple hash tables (11) to boost precision (we used one for both algorithms), the use of multiprobe (24) so that similar tags can be grouped together (which may be easier to implement for the fly algorithm because tags are sparse), various quantization tricks for discretizing hashes (25), and learning [called data-dependent hashing (13)]. There are also methods to speed up the random projection multiplication, both for LSH schemes by fast Johnson-Lindenstrauss transforms (26, 27) and for the fly by fast sparse matrix multiplication. Our goal was to fairly compare two conceptually different approaches for the nearest-neighbors search problem; in practical applications, all of these extensions will need to be ported to the fly algorithm.

Some of the fly algorithm’s strategies have been used before. For example, MinHash (28) and winner-take-all hash (29) both use WTA-like components, though neither propose expanding the dimensionality; similarly, random projections are used in many LSH families, but none, to our knowledge, use sparse, binary projections. The fly olfactory circuit appears to have evolved to use a distinctive combination of these computational ingredients. The three hallmarks of the fly’s circuit motif may also appear in other brain regions and species (Table 1). Thus, locality-sensitive hashing may be a general principle of computation used in the brain (30).

Table 1

Description

This table shows how the basic structure of the fly olfactory algorithm may also appear in other animals and in other parts of the brain. For each animal and brain region, it provides a breakdown of what cells may potentially be involved in each step of the process.

SUPPLEMENTARY MATERIALS

www.sciencemag.org/content/358/6364/793/suppl/DC1

Materials and Methods

Supplementary Text

Figs. S1 to S3

References (3149)

http://www.sciencemag.org/about/science-licenses-journal-article-reuse

This is an article distributed under the terms of the Science Journals Default License.

REFERENCES AND NOTES

1. C. F. Stevens, Proc. Natl. Acad. Sci. U.S.A. 112, 9460–9465 (2015).

2. D. Owald, S. Waddell, Curr. Opin. Neurobiol. 35, 178–184 (2015).

3. G. C. Turner, M. Bazhenov, G. Laurent, J. Neurophysiol. 99, 734–746 (2008).

4. A. C. Lin, A. M. Bygrave, A. de Calignon, T. Lee, G. Miesenböck, Nat. Neurosci. 17, 559–568 (2014).

5. M. Papadopoulou, S. Cassenaer, T. Nowotny, G. Laurent, Science 332, 721–725 (2011).

6. E. A. Hallem, J. R. Carlson, Cell 125, 143–160 (2006).

7. C. F. Stevens, Proc. Natl. Acad. Sci. U.S.A. 113, 6737–6742 (2016).

8. S. R. Olsen, V. Bhandawat, R. I. Wilson, Neuron 66, 287–299 (2010).

9. S. J. Caron, V. Ruta, L. F. Abbott, R. Axel, Nature 497, 113–117 (2013).

10. A. Andoni, P. Indyk, Commun. ACM 51, 117 (2008).

11. A. Gionis, P. Indyk, R. Motwani, in VLDB’99, Proceedings of the 25th International Conference on Very Large Data Bases, M. P. Atkinson et al., Eds. (Morgan Kaufman, 1999), pp. 158–529.

12. H. Samet, Foundations of Multidimensional and Metric Data Structures (Morgan Kaufmann Series in Computer Graphics and Geometric Modeling, Morgan Kaufmann, 2005).

13. Materials and methods are available as supplementary materials.

14. W. Johnson, J. Lindenstrauss, in Conference on Modern Analysis and Probability, R. Beals, A. Beck, A. Bellow, A. Hajian, Eds., vol. 26 of Contemporary Mathematics (American Mathematical Society, 1984), pp. 189–206.

15. S. Dasgupta, A. Gupta, Random Structures Algorithms 22, 60–65 (2003).

16. D. Achlioptas, J. Comput. Syst. Sci. 66, 671–687 (2003).

17. Z. Allen-Zhu, R. Gelashvili, S. Micali, N. Shavit, Proc. Natl. Acad. Sci. U.S.A. 111, 16872–16876 (2014).

18. D. Kane, J. Nelson, J. Assoc. Comput. Mach. 61, 4 (2014).

19. Y. Lin, R. Jin, D. Cai, S. Yan, X. Li, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (IEEE Computer Society, 2013), pp. 446–451.

20. M. S. Charikar, in Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02 [Association for Computing Machinery (ACM), 2002], pp. 380–388.

21. C. Pehlevan, D. B. Chklovskii, in NIPS’15, Proceedings of the 28th International Conference on Neural Information Processing Systems (MIT Press, 2015), pp. 2269–2277.

22. R. Spring, A. Shrivastava, Scalable and sustainable deep learning via randomized hashing. arXiv:1602.08194 [stat.ML] (26 February 2016).

23. M. Slaney, Y. Lifshits, J. He, Proc. IEEE 100, 2604–2623 (2012).

24. Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, in VLDB ’07, Proceedings of the 33rd International Conference on Very Large Data Bases (ACM, 2007), pp. 950–961.

25. P. Li, M. Mitzenmacher, A. Shrivastava, in Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research, 2014), pp. 676–684.

26. A. Dasgupta, R. Kumar, T. Sarlos, in KDD ’11, The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2011), pp. 1073–1081.

27. A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, L. Schmidt, in NIPS’15, Proceedings of the 28th International Conference on Neural Information Processing Systems (MIT Press, 2015), pp. 1225–1233.

28. A. Broder, in Proceedings of the Compression and Complexity of Sequences 1997 (IEEE Computer Society, 1997), p. 21.

29. J. Yagnik, D. Strelow, D. A. Ross, R.-s. Lin, in 2011 International Conference on Computer Vision (IEEE Computer Society, 2011), pp. 2431–2438.

30. L. G. Valiant, Curr. Opin. Neurobiol. 25, 15–19 (2014).

For references 31-49, please see the Supplementary Materials section.

ACKNOWLEDGMENTS

For funding support, C.F.S. thanks the NSF (grant EAGER PHY-1444273), and S.N. thanks the Army Research Office (grant DOD W911NF-17-1-0045). All authors thank A. Lang and J. Berkowitz for helpful comments on the manuscript. Code and data are available at https://bitbucket.org/navlakha/flylsh.