Shrinking Vectors: Dimensionality Reduction to Reduce Search Space and Storage
By Vincent Brandon, Data Coordinator
August 26, 2021
At the UDRC, so much is in a name. First name, last name, and date of birth are our minimum demographic set for matching people across datasets. Unfortunately, when you are in the business of names, you are in the business of being mistyped, misspelled, and misrepresented. Our newest system upgrade is approaching the indexing of names to search millions of people as fast as possible.
Those misspellings and other minor changes in the string that are so common in name data make traditional indexes unsuitable, so we looked to embedding for accurate representations. Unfortunately, our current output vectors are huge. A five-letter name produces a five hundred item long list of numbers and seems excessive. When you have tens of thousands of names to index millions of times a minute, getting the vector size down may yield orders of magnitude more performance at a significantly lower cost.
In this post, we examine PCA and count vector representations of names to answer the question, “how low can we go?” Dimensions that is. There is not a bar or anything.
We want to present names in some standardized form to our PCA transform. In practice, this means transforming arbitrary-length words into fixed-length vectors. Here, we will count how many times a letter appears and assign it to a vector the same length as the letters of the English alphabet.
Principal component analysis (PCA) is a technique that tries to squeeze, rotate, and manipulate data into N-m dimensions. Within PCA, N is the number of features in the original data, and m is less than N (we want to go down, but not up, and negative or 0 dimensions do not make a ton of sense). Unfortunately, we cannot visualize 26-dimensional space well. So instead, we will use a property of PCA called Explained Variance.
Explained variance estimates how much of the difference in data we can explain with our reduced number of principal components. We start with 26 unaltered principal components, the letters “a” through “z.” As we begin reducing dimensions, we are not selecting letters to drop; we are squeezing and rotating all the information in all the letters into synthetic components.Quick Summary
This high level of variance is excellent news! A significant amount of the variance in names can be explained by fewer dimensions than the count of each letter. The UDRC, though, needs to build a search structure from these reduced vectors. We need to ensure that the distance between names remains high; otherwise, search results will yield too many hits. We can estimate how different names are by selecting a random sample and doing pairwise cosine similarities between them. The mean, deviation, and range give us great insights into how “searchable” the data will be. Looking at the mean, we can see how stable the relationships are. Looking at the standard deviation and range, we can see how much spread we get (more the better in this case!)
In our case, we should maintain a reasonably stratified and searchable space by reducing the number of dimensions via PCA. This kind of data experimentation feeds our insights into building high-performance indexing strategies for string data in our matching pipeline. For example, if we can reduce the vector size by twenty percent, that is twenty percent fewer data to be stored and a much smaller tree to search. In practice, we are weighting a location-sensitive array to further help with the stratification of names of different spellings but similar letters. The practice of dimensionality reduction is the same.
Have you experimented with feature selection or autoencoders? What about PCA? Let us know on Twitter @UTDataResearch.
See the code here: