Upsert and Upsert all with Spring Data JPA and Cockroach DB
JPA is a powerful ORM but its performance varies depending on the use case and configurations. Similar to JPA’s save() method which is generally used for both INSERT and UPDATE operations, most of the modern databases provide a similar functionality at the database layer called UPSERT. Letting the database handle the INSERT or UPDATE operation through UPSERT would be more reliable and efficient than JPA firing 2 queries to first do a select operation and then update/insert through save() method, especially when your use-case revolves around bulk inserts or updates. This post demonstrates on how to configure upsert() and upsertAll() methods similar to save() and saveAll() methods for all the JPA repositories in the current project.
Although there are couple of ways to do it, to make it more generic, I’ll be using the concept of reflection on the entity class and its annotations to frame the query — “UPSERT INTO TABLE_NAME () VALUES () ()…..” and execute the query using the entityManager.
For this demonstation, I used ‘users’ table in the cockroach db with a single node local cluster on the ‘movr’ database.
Let’s get started !
To begin with, create a custom JPA repository interface extending the JPA repository to declare upsert() and upsertAll() methods. All the repositories defined in the project should be extending the custom repository instead of JPA repository as shown below.
Now comes the important part — implementing the custom JPA repository. To frame the query, we would need 3 things — table_name, column_names in the table and the data_values for the given corresponding column_names. Fetching table_name would be simple with the use of reflection on the entity class but fetching the column_names from the entity is a bit more complicated because of superclasses or embedded id’s. To tackle this, I’ll be using recursion to fetch the required values. Brief explanation of the functions in the custom JPA repository implementation can be found below.
1. getColumnFields - Fetches all the declared fields in the Entity class and its super classes(if any) that are annotated with either of Column.class or JoinColumn.class or EmbeddedId.class2. formInitialQuery - Forms the initial query 'UPSERT INTO table_name (column_names) VALUES '3. populateTableColumns - Populates the column_names in the above initial query. Recursion is used for EmbeddedId.class.4. populateColumnParams - For a given entity, parameterize each column_name. The output of this function would look something like - ":id0,:city0,:name0,:address0,:credit_card0", where 0 is the passed index.5. populateColumnValues - For each parameterized column from the above function, set the column_value in the native query.6. upsert and upsertAll - upsert for a single entity and upsert all for a list of entites. POSTGRE SQL has a variable limit in a single query as 32767, hence any query statement during upsertAll with variable count more than this limit will be split into batches and executed.
Finally, we have to let the Spring know that it should use the custom JPA repository implementation as a repository base class instead of default repository base class with the following class level annotion.
@EnableJpaRepositories(repositoryBaseClass = ExtendedJpaRepositoryImpl.class)
Performance comparision of save/saveAll vs upsert/upsertAll
All the attached tests are for insert operations with a batch_size of 250 and on cockroach db. I did perform the same for updates with different batch sizes and the results are the same as above in comparision. As it is evident from the above tables, performance is more or less the same for entities upto 100 but for bulk inserts there is a reduction in time to execute the query by 60–80 % for JPA entity implementing the persistable interface and almost 99% over traditional saveAll. If we speak in terms of performance improvement, upsertAll gives almost 200–400% improvement with persistable interface and 30000+% improvement over traditional saveAll.
Although the results are quite overwhelming, the catch with this implementation is the memory requirement, especially when the application is deployed on cloud with 1Gig memory. As we keep on increasing the number of entities, the memory requirement to execute the query also increases for forming and storing large strings. It can be optimized for memory but for now its up to the developer to develop the application in a way that at any given point in time there won’t be an upsertAll operation for 50k or 60k records at once.
I encountered similar results with spring data JDBC as well but more on that in the coming posts. You can find the complete source code for this post on my github.