FileMaker Semantic Search – Part 2: Key Details

Jun 05, 2024

In “FileMaker Semantic Search – Part 1: Fundamental Power” we discussed how to configure your Claris FileMaker 2024 app to support semantic search. Here in Part 2 of this three-part blog series we'll dig deeper into the details of this awesome new functionality, including embeddings and cosine similarity, with a demo tutorial file for exploring AI Call Logging.

Cosine Similarity

As part of the Perform Semantic Find script step, you can configure a cosine similarity threshold. E.g., "return records only if their similarity to my search term is .75 or higher". I tend to like values in the .70 to .85 range; lower than .70 tends to return too many results, and higher than .85 tends to return too few results.

You can use the new CosineSimilarity function to expose to your users how similar each result is to their search term.

CosineSimilarity ( v1 ; v2 )

In the below example, I cache the embedding of the search term "hot weather", then each record uses the CosineSimilarity function to compare that cached embedding with its own embedding.

We can see that "hot weather" is an 81% match with "Giant Umbrella Project", but only a 75% match with "Solar-Powered Ice Cream Truck". Exposing the cosine similarity can reinforce to the user that the most relevant result is at the top, and also distinguish between results depending on how much the similarity wanes as the user scrolls downward.

Embeddings

CosineSimilarity compares two vector embeddings. A vector embedding is a multi-dimensional, numeric representation of the semantic meaning of a text string, where each semantic ‘dimension’ is a number from -1 to 1 (the closer to 1 the number, the more the text is related to that semantic dimension). This demo has been tested with OpenAI's text-embedding-ada-002 model, which outputs 1,536 dimensions. That's a lot of ways to think of a text string!

Let’s peek at one of those embeddings, via some new functions. First, GetEmbeddingAsText converts a container embedding to text.

GetEmbeddingAsText ( data )

If you had instead stored your embedding as text, and wanted to containerize it, you could use GetEmbeddingAsFile.

GetEmbeddingAsFile ( text {; fileNameWithExtension } )

And finally, the GetEmbedding function lets you generate an embedding file (to store in a container), without having to call the Insert Embedding or Insert Embedding in Found Set script steps.

GetEmbedding ( account ; model ; text )

Let's combine two of those functions to see the embedding for the word "strawberry".

GetEmbeddingAsText ( GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "strawberry" ) )

Here's a portion of the monster result:

[-0.013456542000000000156, -0.029335528999999999078, -0.0024224438000000000915, -0.01159312399999999979, -0.01380260600000000025, 0.0080725940000000006441,... -0.015599473000000000852]

Each of those numbers is a dimension. Since my chosen model supports 1,536 dimensions, the full embedding for what I truncated above would include 1,536 numbers.

Example Cosine Similarities

Now we can do fun things like this:

CosineSimilarity (
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "strawberry" ) ;
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "apple" )
) = .83659222844622005422

.84 seems very similar, as we'd guess. But what about this?

CosineSimilarity (
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "strawberry" ) ;
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "up" )
) = .78570410647567556772

Hmm. That makes it seem like "strawberry" is almost as closely related to “up” as it is to “apple”. But what about this:

CosineSimilarity (
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "strawberry" ) ;
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "sweet, red fruit with a green stem and tiny seeds on the outside" )
) = .84403333253313084228

That last example has a higher cosine similarity than the first example comparing “strawberry” and “apple”, which we’d expect, since they are more closely describing the same thing. But that’s a narrow range between the third, helpful example (.84) and the second, unhelpful example (.79). We can’t anticipate a user’s search term, and neither can we understand our data’s vector embeddings, so how can we choose an appropriate cosine similarity to distinguish good results from poor results? One method is to not limit the results. If you return all results, and expose the cosine similarity to the user, the user can judge at what point the results seem to become poor. (In a subsequent section I’ll propose another method to arrive at a good cosine similarity.)

One of the cool things about embeddings is how they can compress a many-word string into a single numerical array. The more text in that string, the more opportunity for a rich embedding. In a traditional FileMaker search (with a single find request), the more verbose the search term, the fewer results will be returned. With a semantic search, the opposite can be true. If you’ll recall from above, searching for “hot weather” returned only 5 results. But searching for “hot weather robotic farming” returns 21 results.

Case Sensitivity

I was surprised to find that the embeddings are case sensitive. Take this example:

CosineSimilarity (
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "hot" ) ;
	GetEmbedding ( GLOBAL::llm_account ; GLOBAL::llm_model ; "HOT" )
) = .89758793488336974242

That means “hot” and “HOT” are only 90% similar (whereas I might have expected them to be 100% similar). But should I be surprised? Case carries meaning. When capitalized, “HOT” might be emphatic, or be an acronym. So there might be extra meaning there that distinguishes it from “hot”.

In one data set, I searched for “the dog is really happy” and found 26 records with at least 85% similarity. When I searched for “the dog is REALLY happy” I found only 16 records. In both cases my top hit was “The dog wagged its tail happily.”, but that result was 91% similar to my first search term, without the capitalization, whereas it was only 89% similar to my second search term, with the capitalization.

In practice this means users might get wildly different results because of how they choose to (or neglect to) include case in their search term. I’m unaccustomed to including case in my own search terms, because doing so requires an extra keystroke. But I think it would be a mistake to programmatically avoid this nuance (e.g., by dropping the case of the search term before getting its embedding); doing so would mean the user could never add case meaning to their search term. I think better to train users on this nuance, to let them leverage this power when appropriate.

Sort Order

Sorting the results by relevancy is such an awesome part of semantic search. Users are already trained from other applications to expect this; it saves the user time; and it starts to make more sense to programmatically limit the number of results when we have confidence that the best results are shown.

But how is FileMaker delivering this cool new sort? When we look at the sort order established by the Perform Semantic Find script step, we see that it sorts descending by the embedding field itself.

Remember: this embedding field is a container, and we can't normally sort on container fields. It's as if displaying the container field in the sort order is FileMaker's way of saying "sorted by the cosine similarity of this embedding field and whatever embedding was used in the semantic find".

In that sense, it’s as if the found set is sorted by some unseen unstored calculation: it retains in temporary memory what the cosine similarity is for each record in the found set, relative to the embedding of the originating search term.

What happens in a multi-window environment? The sort is scoped to your window: a different window can sort the same records via a different semantic search, producing a different sort order. Very cool.

AI Call Logging

For various reasons, you might want to look behind the curtain of these new script steps, to learn more about what information those steps are passing to and receiving from your selected model. That's where the new Set AI Call Logging script step comes in.

Set AI Call Logging [ ]

When AI call logging is enabled, executing any of these new AI script steps will add to a local log file (you'll see more detail if you choose the Verbose option).

Examining the log while you're developing this new feature might give you insight into how successful your cosine similarity threshold is; e.g., you might notice that the threshold you've set is failing to find any records more often than not.

So what do you do with that information? In this example file, I extend the Perform Semantic Search demo file to include AI call logging.

Tutorial / Demo File

Download Beezwax AI Call Logging.fmp12 (2.4 MB)

Download includes FileMaker (.fmp12) demo file, LICENSE and README.
FileMaker 2024 (v21.x) is required to use demo file.
For this file and those in Part 1 and Part 3, you’ll need to supply your own AI credentials to use the file (via the gear icon in the top right).
If you employ credentials that aren’t from OpenAI, you’ll also need to configure the LLM Embedding Model field and execute the embeddings script (to build embeddings compatible with the embedding model you’ve chosen).

Using the Beezwax AI Call Logging file

First, before configuring the AI account, the script will turn on verbose call logging.

Second, after every semantic find is performed, the script checks for an error. If no records were found, the script performs the semantic search again, but with a lower cosine similarity (the script will make 4 attempts). Though we want relevant results, we prefer some results over no results, so let's lower the bar a bit until we get a hit. Since the appropriate cosine similarity threshold might depend on the search term itself, using this iterative approach will help your users find results consistently.

Third, once the semantic search loop has finished, a subscript harvests the AI call log from the user's machine, archiving the section from the log corresponding to that semantic search loop. With this in place, the developer can gain insight into how the semantic search is performing across the user base (since normally the developer would have access to the call log on their machine only). It's particularly interesting to see the cosine similarity of each record eventually returned.

In this example I searched for “hot weather”. The script first attempted to find results using a .85 cosine similarity, but that resulted in error 401 (no records found). So it then tried again with a .75 cosine similarity, finding 5 results.

Note: the verbose call log will also include the entire embedding of the search term, but for readability I’ve truncated that embedding in the demo file. Notice that the embedding is included only once in the log, even though I performed two semantic searches. When I shifted to this iterative approach (and when I exposed the cosine similarity in the interface) it made sense to store the embedding of the search term so that I didn’t have to continually regenerate that embedding each time I needed to reference it. This matters for performance, but also for financial cost.

Cost

If you configure your app with an account from a commercial LLM provider, there will be an associated cost for each call you make to a model using that account. Each time you create an embedding, the number of “tokens” you send to construct that embedding will contribute to your overall cost, where a token is a word, portion of a word, or even a punctuation mark in your text string.

Getting the embedding of a search term will typically use only a few tokens, but getting the embedding of a large text blob, e.g., that describes an entire record, will use many more tokens. You might have noticed that the AI call log includes the token count of the search term. FileMaker 2024 also gives us a new function to calculate this on our own:

GetTokenCount ( text )

Here are some examples:

GetTokenCount ( "hot weather" ) = 2

GetTokenCount ( "Giant Umbrella Project¶Deploying a network of giant umbrellas in urban areas to provide shade and mitigate the effects of heatwaves." ) = 28

Note that there are only 21 words in that second example, so some words are being broken into more than one token.

So although we want a rich embedding that fully captures the nuances of our data, we might have a financial incentive to be efficient with how we build those embeddings. One way is to avoid unnecessary duplication, as I mentioned above via caching the embedding of the search term. Another way is to be selective with which fields you include in a record’s embedding (LLMs are best suited to text).

Security

So if the cost of an embedding relates to the tokens your text string is broken into, does that mean your text string (data) is somehow being shared with your LLM provider?

Yes. Yes it is.

And for that reason alone you might decide to (or be required to) leverage a local model, where no data is transmitted beyond your app’s protected ecosystem.

My demos are using test data generated by ChatGPT, so I’ll freely share that back to OpenAI to build embeddings. For real data, you’d have to decide how sensitive is the data, and how well does that align with the LLM provider’s security policy. Here’s a sample statement from OpenAI’s security policy:

Data submitted through the OpenAI API is not used to train OpenAI models or improve OpenAI's service offering.
https://openai.com/security

That’s not the same as saying “we don’t retain your data at all”, though. So you’ll need to find the right LLM model for your app.

Note that several of FileMaker 2024’s new functions/script steps do not invoke an API call, and so do not have an associated cost/security impact:

Functions: CosineSimilarity, GetEmbeddingAsFile, GetEmbeddingAsText, GetTokenCount
Script Step: Perform Semantic Find (when configured to query by vector data)

Perform Semantic Find vs. Perform Find

I love the new semantic search, but it’s a tool in addition to the existing Perform Find, rather than a replacement for it. Whenever possible I try to keep the power of a manual find at my users’ fingertips. I’d prefer Perform Find over Perform Semantic Find in these scenarios:

Searching a number, time, or timestamp field. A semantic search won’t behave well on these types of fields anyway, but regardless, FileMaker already supports awesome operators on these fields types, like ranges and wildcards.
Searching for literal strings. A semantic search looks for meaning, but often I want to search for a specific pattern, like “is null”, “is not null”, “begins with”, or “contains”. Perform Find makes this easy.
Searching for an exact match. A semantic search is ideal when we can’t anticipate how the search term or data are structured. But when a user wants to find people who live in New York, they’re looking for an exact match on a specific field, not for people who are “like New York”.
Compound find request. In part 1 I mentioned my semantic search wasn’t quite understanding the nuances of “hot or cold” vs. “hot but not cold”. A compound find request allows a user to structure their search term.
Extend, Omit steps. At present, Perform Semantic Find supports constrains, but not extends, so we can’t yet combine multiple semantic searches. And there’s no substitute for an omit, allowing the user to massage their found set into exactly what they want.

Now that we’ve covered so many of the details in this new feature, in “FileMaker Semantic Search: Part 3 – Advanced Fun” I’ll move on to some really fun examples.

FileMaker 2024 Semantic Search – Reference Blog Posts

FileMaker Semantic Search – Part 1: Fundamental Power

FileMaker Semantic Search – Part 2: Key Details

FileMaker Semantic Search – Part 3: Advanced Fun

FileMaker AI

FileMaker Development

Search Tools

bzSemanticSearch-FM

bzActionsUnifiedSearch-FM

The Amazing FileMaker

AI Development

InspectorPro 8 Semantic Search