DuckDB – the current analytical database management system used by Google, Facebook and Airbnb – has released its 0.5.0 iteration.
The brainchild of academics at Amsterdam’s Centrum Wiskunde & Informatica Mathematical and Theoretical Computation Research Center, DuckDB is embedded in a host process. There is no DBMS server software to install, update or maintain.
For example, the DuckDB Python package can run queries directly against data from the Python Pandas software library without importing or copying data. Written in C++, DuckDB is free and open source under the MIT license.
Consulting and support is provided by DuckDB Labs. Co-founder and CEO Hannes Mühleisen, who also co-wrote the code and maintains the project, said The register he was inspired by SQLite, the serverless OLTP database engine, where he saw the opportunity for a similar approach, but for analytics.
“We were working a lot with data science practitioners and they had all these problems that weren’t theoretical problems in computational research anymore – they were solved a long time ago – but somehow , the software just wasn’t there for them. With the commercial software vendors, the technology was in some of those packages, but not accessible or hidden behind many, many layers of corporate bullshit,” he said. he declares.
Mühleisen and his co-founder began to realize that an overhaul of the database architecture might be needed for OLAP. “We took this idea of in-process data management systems where the whole database manager runs in the process you’re in – say, Python or even Excel – and redesigned a system to be first in class for OLAP using this approach,” said Mühleisen, who is still a senior researcher at his academic institution.
DuckDB is also often used as part of a larger data management or analytics stack. For example, if someone creates a custom application that collects data, and then wants to create an SQL interface, they may first have to copy the data and move it to another system, which could lead to problems with synchronization, he said. But DuckDB can query third-party datasets as if they were its own data. “You can design this on top of an existing app or data set. And people are doing it,” he said.
The popularity of the system among data tool builders has even prompted his own meme.
The first release was in 2019 and has steadily grown in popularity, with users including Google, Facebook, and Airbnb.
This week the project was published its iteration 0.5.0.
Among the new features, highlights include “out of core”, which aims to address issues that can arise when in-flight data is larger than memory by offering intermediate results. The project also added join order optimization, a common problem in analytic databases. Hyoun Park, CEO and Chief Analyst at Amalgam Insights, said DuckDB’s differentiation comes from the fact that it is a small application that works in code-based processes to quickly analyze large data stores. .
“This is increasingly important as workloads are distributed, performance is needed in a variety of analytical use cases, and analytical data continues to double year over year in large organizations” , Park said. “As an open-source database that can be easily integrated into specific analytical tasks, DuckDB is well suited to fill the gaps where traditional monolithic OLAP databases are more rigid, more expensive, or require transfer and duplication efforts to take supports analytical variety.
“DuckDB can often run queries directly on the data without any middleman processing, which improves processing. From a pure technology standpoint, it’s somewhat similar to Actian Vector, which also takes a columnar vectorized OLAP query approach. , although Actian is designed to contribute data rather than working within a specific process or workload.”
But there are clear limits as to when and where the system should and should not be used. Although in some respects it offers a cheap alternative to a data warehouse and can provide every data scientist with a system on their laptop, it does not necessarily replace the enterprise data warehouse systems of companies. such as Teradata, Oracle and IBM. The home page clearly states that it should not be used for “large client/server installations for centralized enterprise data warehousing”.
“It’s a matter of priorities for your organization or a data problem. Does it really depend on everyone working on the same data? If so, then maybe it’s not the best solution,” said Mühleisen.
This being open source databases, the project comes with an unusual name. While CockroachDB was named after its supposedly unkillable nature, and MongoDB was a contraction of “humongous”, DuckDB was of course named after Mühleisen’s pet Wilbur, who either by the way, appears in The Guardian newspaper.
The project is working towards its version 1.0, after which backward breaking changes will not appear. “I think we get there with a lot of hard work. We always say by the end of the year, but I’m afraid that won’t happen this year,” Mühleisen said. ®