NLP: Indexing and Topics
Today we will talk about how to create an index and how that makes searching easier. If time, we will address how this can help to understand topic models.
Basic tabular lookups
The most basic search would look like this:
| key | record |
|---|---|
| cat | PET0001 |
| dog | PET0002 |
| mouse | PET0003 |
Time for tabular lookups
- Estimated 500 nanoseconds per record for basic search
- (500 * 1,000,000) nanoseconds = 0.5 seconds
- Current Catalogue of Life has 7,000,000 records = 3.5 seconds
Edit distance
Last time we talked about edit distance. A quick reminder:
cat is 1 edit away from cats
cut is 1 edit away from cat
Edit distance search
- 1000 nanoseconds per edit distance check
- ((1000 + 500) * 7,000,000) nanoseconds = 10.5 seconds
- 10.5 seconds * 2,000,000 items in the Beaty = 243.055556 days
Can't do this!
Input cleaning (stemming)
One approach is to simply clean the input.
Indexing: rank search
A better approach is to pre-compute a distance metric and build an index.
| term | synonym | distance |
|---|---|---|
| kitten | kittie | 1 |
| kitten | cat | 2 |
But how do we create this sense of semantic distance?
Relatedness and search
In the Beaty, we have a few potential distance metrics:
- Taxonomic proximity
- Vernacular names co-occurence
- Other relations (spatial, temporal, topical, etc.)
Tree-based semantic distance
There is a semantic similarity between cats and dogs. If you have a tree like:
pets -> cats
pets -> dogs
You can say that cats are related to dogs via one parent, pets. There are big hand-developed dictionaries that relate concepts that can give you a distance.
Taxonomic proximity
Allophrynidae Allophryne resplendensis asiblingofAllophrynidae Allophryne ruthveni- You could define proximity to be number of steps to a common
parent. - Therefore, this would be a distance of
1.
Vernacular name co-occurence
The Catalogue of Life gives us a list of vernacular names.
| ID | Lang | Name |
|---|---|---|
| 74YS2 | eng | slime membrane sponge |
| 74YS2 | fra | éponge poisseuse |
| 6SYS3 | deu | Feigenbohrschwamm |
| 6SYS3 | deu | Fleischschwamm |
| 6SYS3 | spa | esponja azul |
| 6SYS3 | eng | flesh sponge |
| 6SYS3 | eng | lobate fig sponge |
Vernacular name co-occurence
We could define a vernacular distance.
| ID | Lang | Name |
|---|---|---|
| 74YS2 | eng | slime membrane sponge |
| 6SYS3 | eng | lobate fig sponge |
| ID_1 | ID_2 | Key | Distance |
|---|---|---|---|
| 74YS2 | 6SYS3 | sponge | 1 |
Beaty's special co-location metric
Let's define the following metrics:
- Collection event distance
- Collector distance
- Remarks distance
Topic models
Topic models are basically automatically-generated groupings of similar records based on custom metrics.