DuckDB 入門


これは何

  • DuckDB 入門メモ

DuckDB とは

  • DuckDB はオープンソースの DBMS
  • SQLite っぽい感じだが OLAP 向けに設計されている

DuckDB – An in-process SQL OLAP database management system
https://duckdb.org/

duckdb/duckdb: DuckDB is an analytical in-process SQL database management system
https://github.com/duckdb/duckdb

とりあえず動きが見たい

  • web ブラウザで動作する Demo がある

DuckDB Shell
https://shell.duckdb.org/

インストール

  • brew でインストール可能

    brew install duckdb

DuckDB Installation – DuckDB
https://duckdb.org/docs/installation/

% duckdb -help   
Usage: duckdb [OPTIONS] FILENAME [SQL]
FILENAME is the name of an DuckDB database. A new database is created
if the file does not previously exist.
OPTIONS include:
   -append              append the database to the end of the file
   -ascii               set output mode to 'ascii'
   -bail                stop after hitting an error
   -batch               force batch I/O
   -box                 set output mode to 'box'
   -column              set output mode to 'column'
   -cmd COMMAND         run "COMMAND" before reading stdin
   -c COMMAND           run "COMMAND" and exit
   -csv                 set output mode to 'csv'
   -echo                print commands before execution
   -init FILENAME       read/process named file
   -[no]header          turn headers on or off
   -help                show this message
   -html                set output mode to HTML
   -interactive         force interactive I/O
   -json                set output mode to 'json'
   -line                set output mode to 'line'
   -list                set output mode to 'list'
   -markdown            set output mode to 'markdown'
   -newline SEP         set output row separator. Default: '\n'
   -nofollow            refuse to open symbolic links to database files
   -no-stdin            exit after processing options instead of reading stdin
   -nullvalue TEXT      set text string for NULL values. Default ''
   -quote               set output mode to 'quote'
   -readonly            open the database read-only
   -s COMMAND           run "COMMAND" and exit
   -separator SEP       set output column separator. Default: '|'
   -stats               print memory stats before each finalize
   -table               set output mode to 'table'
   -unredacted          allow printing unredacted secrets
   -unsigned            allow loading of unsigned extensions
   -version             show DuckDB version

% duckdb
v1.1.1 af39bd0dcf
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D .help
.bail on|off             Stop after hitting an error.  Default OFF
.binary on|off           Turn binary output on or off.  Default OFF
.cd DIRECTORY            Change the working directory to DIRECTORY
.changes on|off          Show number of rows changed by SQL
.check GLOB              Fail if output since .testcase does not match
.columns                 Column-wise rendering of query results
.constant ?COLOR?        Sets the syntax highlighting color used for constant values
.constantcode ?CODE?     Sets the syntax highlighting terminal code used for constant values
.databases               List names and files of attached databases
.dump ?TABLE?            Render database content as SQL
.echo on|off             Turn command echo on or off
.excel                   Display the output of next command in spreadsheet
.edit                    Opens an external text editor to edit a query.
.exit ?CODE?             Exit this program with return-code CODE
.explain ?on|off|auto?   Change the EXPLAIN formatting mode.  Default: auto
.fullschema ?--indent?   Show schema and the content of sqlite_stat tables
.headers on|off          Turn display of headers on or off
.help ?-all? ?PATTERN?   Show help text for PATTERN
.highlight [on|off]      Toggle syntax highlighting in the shell on/off
.import FILE TABLE       Import data from FILE into TABLE
.indexes ?TABLE?         Show names of indexes
.keyword ?COLOR?         Sets the syntax highlighting color used for keywords
.keywordcode ?CODE?      Sets the syntax highlighting terminal code used for keywords
.log FILE|off            Turn logging on or off.  FILE can be stderr/stdout
.maxrows COUNT           Sets the maximum number of rows for display (default: 40). Only for duckbox mode.
.maxwidth COUNT          Sets the maximum width in characters. 0 defaults to terminal width. Only for duckbox mode.
.mode MODE ?TABLE?       Set output mode
.nullvalue STRING        Use STRING in place of NULL values
.once ?OPTIONS? ?FILE?   Output for the next SQL command only to FILE
.open ?OPTIONS? ?FILE?   Close existing database and reopen FILE
.output ?FILE?           Send output to FILE or stdout if FILE is omitted
.parameter CMD ...       Manage SQL parameter bindings
.print STRING...         Print literal STRING
.prompt MAIN CONTINUE    Replace the standard prompts
.quit                    Exit this program
.read FILE               Read input from FILE
.rows                    Row-wise rendering of query results (default)
.schema ?PATTERN?        Show the CREATE statements matching PATTERN
.separator COL ?ROW?     Change the column and row separators
.sha3sum ...             Compute a SHA3 hash of database content
.shell CMD ARGS...       Run CMD ARGS... in a system shell
.show                    Show the current values for various settings
.system CMD ARGS...      Run CMD ARGS... in a system shell
.tables ?TABLE?          List names of tables matching LIKE pattern TABLE
.testcase NAME           Begin redirecting output to 'testcase-out.txt'
.timer on|off            Turn SQL timer on or off
.width NUM1 NUM2 ...     Set minimum column widths for columnar output

JSON ファイルをインポートする

  • DuckDB は CSV や JSON ファイルをロードしてよしなに使うことが出来る
  • JSON ファイルのドキュメントを読みつつ、動作を確認していく

JSON Loading – DuckDB
https://duckdb.org/docs/data/json/overview

データを用意

todos.json として、以下のようなデータを用意する

[
  {
    "userId": 1,
    "id": 1,
    "title": "Buy groceries",
    "completed": false
  },
  {
    "userId": 1,
    "id": 2,
    "title": "Attend meeting",
    "completed": true
  },
  {
    "userId": 2,
    "id": 3,
    "title": "Write report",
    "completed": false
  },
  {
    "userId": 2,
    "id": 4,
    "title": "Plan trip",
    "completed": true
  }
]

read_json 関数を使用する

duckdb で read_json 関数を使用する

SELECT *
FROM read_json('todos.json',
               format = 'array',
               columns = {userId: 'UBIGINT',
                          id: 'UBIGINT',
                          title: 'VARCHAR',
                          completed: 'BOOLEAN'});

実行結果の例

% duckdb
v1.1.1 af39bd0dcf
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SELECT *
  FROM read_json('todos.json',
                 format = 'array',
                 columns = {userId: 'UBIGINT',
                            id: 'UBIGINT',
                            title: 'VARCHAR',
                            completed: 'BOOLEAN'});
┌────────┬────────┬────────────────┬───────────┐
│ userId │   id   │     title      │ completed │
│ uint64 │ uint64 │    varchar     │  boolean  │
├────────┼────────┼────────────────┼───────────┤
│      11 │ Buy groceries  │ false     │
│      12 │ Attend meeting │ true      │
│      23 │ Write report   │ false     │
│      24 │ Plan trip      │ true      │
└────────┴────────┴────────────────┴───────────┘
D 

stdin でファイルを入力する使用する方法

% cat ./todos.json | duckdb -c "SELECT * FROM read_json('/dev/stdin')"
┌────────┬───────┬────────────────┬───────────┐
│ userId │  id   │     title      │ completed │
│ int64  │ int64 │    varchar     │  boolean  │
├────────┼───────┼────────────────┼───────────┤
│      11 │ Buy groceries  │ false     │
│      12 │ Attend meeting │ true      │
│      23 │ Write report   │ false     │
│      24 │ Plan trip      │ true      │
└────────┴───────┴────────────────┴───────────┘

JSON からテーブルを作成する

JSON ファイルを COPY してテーブルとする

CREATE TABLE todos (userId UBIGINT, id UBIGINT, title VARCHAR, completed BOOLEAN);
COPY todos FROM 'todos.json';

SELECT で JSON ファイルを読み取ってテーブルとする

CREATE TABLE todos AS
    SELECT * FROM 'todos.json';

JSON ファイルに SQL でクエリした結果をテーブルとする

COPY (SELECT * FROM todos) TO 'todos.json';

テーブル情報を確認する

.tables でテーブルリストを確認

D .tables
todos

DESCRIBE, SHOW でテーブルスキーマを確認

D DESCRIBE todos;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ userId      │ BIGINT      │ YES     │         │         │         │
│ id          │ BIGINT      │ YES     │         │         │         │
│ title       │ VARCHAR     │ YES     │         │         │         │
│ completed   │ BOOLEAN     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
D SHOW todos;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ userId      │ BIGINT      │ YES     │         │         │         │
│ id          │ BIGINT      │ YES     │         │         │         │
│ title       │ VARCHAR     │ YES     │         │         │         │
│ completed   │ BOOLEAN     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

S3 Object として保存してある JSON をロードする

# AWS クレデンシャルのロード
D CREATE SECRET mysecret (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

# JSON をロード
D select * from read_json_auto('s3://mybucket/myobject.json');

AWS Extension – DuckDB
https://duckdb.org/docs/extensions/aws.html

S3 API Support – DuckDB
https://duckdb.org/docs/extensions/httpfs/s3api

感想

  • サクッと動かせてよかった
  • DuckDB は JSON に限らず様々なデータをロードできるため、ちょっとしたデータ分析を SQL で行いたい場合に便利に使えそう
  • 個人的には AWS CLI などのツールで巨大な JSON を調べることが多いので、そういう分析に便利に使えたら良さそうだなと思っている

Guides – DuckDB
https://duckdb.org/docs/guides/overview

試した環境

% sw_vers
ProductName:            macOS
ProductVersion:         14.7
BuildVersion:           23H124
% duckdb -version
v1.1.1 af39bd0dcf