Generative code: how Frontier solves the LLM Security and Privacy issues4 min read
Reading Time: 3 minutesWhen it comes to generative AI and LLMs, the first question we get is how we approach the security and privacy aspects of Frontier. This is a reasonable question given the copyright issues that many AI tools are plagued with. AI tools, after all, train on publicly available data and so could expose companies to potential copyright liability.
But it’s not just that, companies have invested heavily in their design language and design systems, which they would never want to expose externally and their code base is also a critical asset which they would never want to partake in LLM or AI training.
When designing Frontier, privacy and security were foremost concerns from day one. First, it was clear to us that Frontier users cannot expose their codebase to anyone, including us. That means that much of the data processing had to take place on the user’s device, which is quite difficult given that we are running in a sandbox inside a VSCode Extension. Secondly, we needed to expose the minimum amount of data and design to the cloud. Additionally, any data that needed to be stored, had to be stored in such a way where it could be shared by multiple team members, but should not be stored on the cloud. Finally, none of our models could have any way to train from the user’s design or codebase.
The first part was isolating the Figma designs. By building a simplified data model, built in memory from within VSCode, using the user’s own credentials, we are effectively facilitating an isolated connection between the user and Figma APIs without us in between and without our servers even seeing a copy of the design.
The typical implementation used for generative code generation is to collect the entire code base, break it into segments, encode the segments into embeddings and storing them into a vector database. This approach is effective but won’t work well in our case, since storing this data on our servers would mean we are exposed to the data. In addition, the code base is continually evolving and would need to be reencoded and stored every so often, which would make this process slow and ineffective.
Instead, our approach was to develop an in-memory embedding database, which can be stored and retrieved locally and rebuilds extremely quickly, even on large codebases. In order to secure this data, we store it on the user’s workspace, where it can be included in the git repository and shared between the users, or simply rebuilt per-user.
But this would be useless if we would have to send a large code sample to an LLM for each line of code we generate. Instead, we implemented a local model that runs in VSCode, so when we do need to use an LLM, we share the interface of the components instead of needing code. Users can improve the results by opting in to include some real-world usage examples of how Button is used in the codebase, sharing with the LLM a simplified thin code showing how Button component is used in the code base, but not how Button is implemented or what it actually looks like or does…
By limiting the amount of data and anonymizing it, we can guarantee that the LLM doesn’t get trained or store the user’s code in any way.
But how do we guarantee that data doesn’t get “leaked” from outside sources that the LLM trained on back into the codebase, exposing the company to potential copyright risk? First, we limit the type of code that the LLM can generate to specific component implementations, only after it passes a guard rail system. The LLM Guard rail validates the code makes sense, and can identify hallucinations that might invalidate the code or introduce copyright liability to the code base. If the code passes the guard rail system, we are extremely sure that the results correlate with what the user expects from the component code.
Finally, for full transparency, we store the data in open JSON files inside the .anima folder on your project’s workspace. Different workspaces would have different settings and components. Sharing this information between users can be done through git (or a shared file system of any kind), which eliminates Anima from being exposed to any of the cached data for components, usage or the entire codebase or Figma design data.