The two approaches exhibit two key differences:ġ) Zero-Copy. These communication layers include support for Python built-in types and PyArrow Tables. So we are delighted to announce the recent addition of Scalar Python UDFs to DuckDB.ĭuckDB provides support for two distinct types of Python UDFs, differing in the Python object used for communication between DuckDB’s native data types and the Python process. However, this duck is not limited to just waddling it can also fly. Consequently, the initial focus of extensibility efforts was centered around C++. As a DBMS tailored for analytical tasks, performance is a key consideration, leading to the implementation of its core in C++. Consequently many users cannot quickly implement their UDFs without investing a significant amount of time in learning a low-level language and understanding the internal details of the DBMS.ĭuckDB followed a similar approach. UDFs in higher-level languages like Python incur significant performance costs. High-Performance UDFs are typically only supported in low-level languages. 2) The difficulty of implementation is a common deterrent for users. Therefore, the impact on server stability is not a significant worry. However, when it comes to DuckDB, an embedded database, this concern is mitigated as each analyst runs their own DuckDB process separately. Since UDFs are custom code created by users and executed within the DBMS process, there is a potential risk of crashing the server. 1) There are security concerns associated with UDFs. There are two main reasons users often refrain from implementing UDFs. The sensitive data never leaves the DBMS process. The functions can be utilized in various SQL contexts (e.g., subqueries, join conditions).ģ) Safety. This eliminates the need for passing data through a separate database connector and executing external code. UDFs can be seamlessly integrated into SQL queries, allowing users to leverage the power of SQL to call the functions. The function could be executed using the same execution model (e.g., streaming results, beyond-memory/out-of-core execution) of the DBMS, and without any unnecessary transformations.Ģ) Easy Use. This approach offers several advantages:ġ) Performance. Ideally, this function would be executed directly in the DBMS. For instance, users who frequently need to export private data can benefit from an anonymization function that masks the local part of an email while preserving the domain. User Defined Functions (UDFs) enable users to extend the functionality of a Database Management System (DBMS) to perform domain-specific tasks that are not implemented as built-in functions. By implementing Python UDFs, users can easily expand the functionality of DuckDB while taking advantage of DuckDB’s fast execution model, SQL and data safety. TLDR: DuckDB now supports vectorized Scalar Python User Defined Functions (UDFs). Pedro Holanda, Thijs Bruineman and Phillip Cloud From Waddle to Flying: Quickly expanding DuckDB's functionality with Scalar Python UDFs
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |