ClickHouse的安装

  1. mac中安装docker
    2.docker的源建议为http://hub-mirrors.c.163.com
    3.下载客户端

    • docker pull yandex/clickhouse-client
      4.下载服务端
    • docker pull yandex/clickhouse-server
  2. 创建单独的network
    docker create network ck_net
    6.用下载好的镜像启动容器
    docker run -d --name ck-server --network=cknet --ulimit nofile=262144:262144 --volume=$HOME/www/clickhouse:/var/lib/clickhouse yandex/clickhouse-server
    7.验证是否启动完成
    ![](https://www.liritian.com/wp-content/uploads/2023/05/391685456754
    .pic_.jpg)
    8.进入容器 docker exec -it ck-server /bin/bash
    9.启动客户端 clickhouse-client
    10.进入客户端 查看数据库 show databases;没错就是它,跟mysql 一模一样
    11.创建一个数据库玩玩 create database if not exists jack;
    12.切换数据库 use jack;嘿嘿 还是它
    13.创建一个表看下
    create table jack.user(user_id UInt32,message String,timestamp Datetime,metric Float32)ENGINE=MergeTree()PRIMARY KEY(user_id,timestamp)

    CREATE TABLE jack.user
    (
    user_id UInt32,
    message String,
    timestamp Datetime,
    metric Float32
    )
    ENGINE = MergeTree //注意引擎不一样哦
    PRIMARY KEY (user_id, timestamp)

14.插入几条数据咯
INSERT INTO jack.user (user_id, message, timestamp, metric) VALUES
(101, 'Hello, ClickHouse!', now(), -1.0 ),
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )

15.查询一下 见证一下 奇迹
select from user; yeah 你肯定猜到了
查询ck
还可以增加排序 select
from user order by timestamp asc 或者desc

16.来点不一样的吧
select * from user order by timestamp desc format tabseparated
ck tabSeperated

17.搞个大的
英国房地产支付价格
该数据集包含自 1995 年以来有关英格兰和威尔士房地产价格的数据。未压缩的大小约为 4 GiB,在 ClickHouse 中大约需要 278 MiB。足足2000多万的数据啊,谨慎操作,很漫长的,因为great wall的原因

  • 建张表
  • CREATE TABLE uk_price_paid
    (
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
    is_new UInt8,
    duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String)
    )
    ENGINE = MergeTree
    ORDER BY (postcode1, postcode2, addr1, addr2);

    • 重头戏插数据咯
      INSERT INTO uk_price_paid
      WITH
      splitByChar(' ', postcode) AS p
      SELECT
      toUInt32(price_string) AS price,
      parseDateTimeBestEffortUS(time) AS date,
      p[1] AS postcode1,
      p[2] AS postcode2,
      transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
      b = 'Y' AS is_new,
      transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration,
      addr1,
      addr2,
      street,
      locality,
      town,
      district,
      county
      FROM url(
      'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv',
      'CSV',
      'uuid_string String,
      price_string String,
      time String,
      postcode String,
      a String,
      b String,
      c String,
      addr1 String,
      addr2 String,
      street String,
      locality String,
      town String,
      district String,
      county String,
      d String,
      e String'
      ) SETTINGS max_http_get_redirects=10;

将从pp-complete.csv把数据源源不断的塞进ck里面,时间很漫长……天黑请闭眼。

  • 时间很漫长……
  • 时间很漫长……
  • 时间很漫长……
  • 很漫长……
  • 漫长……
  • 长……
  • ……

睡了一觉之后醒来,发现终于可以了
插入2000万数据耗时
将近一个小时的等待,花都谢了~

  • 赶快验证下结果
  • 两千八百万数据
    2800万数据,0.023s
    嘻嘻 是不是很棒
  • 查看下表的大小
    SELECT formatReadableSize(total_bytes)
    FROM system.tables
    WHERE name = 'uk_price_paid'

300m
300M 看来压缩的不错

  • 小试牛刀
    查询每年平均价格
    SELECT
    toYear(date) AS year,
    round(avg(price)) AS price,
    bar(price, 0, 1000000, 80
    )
    FROM uk_price_paid
    GROUP BY year
    ORDER BY year

处理了28.21millions 用了2.571s 我还以为在1s内,哎有点失望,毕竟看其他文章20万亿都是秒级,可能是我老婆的电脑不行,破mac 垃圾mac。不过比mysql 好多了曾经我们的用户表2200万,我业务里都查不动了。

  • 查询伦敦每年的平均价格
    SELECT
    toYear(date) AS year,
    round(avg(price)) AS price,
    bar(price, 0, 2000000, 100
    )
    FROM uk_pricepaid
    WHERE town = 'LONDON'
    GROUP BY year
    ORDER BY year
    ![london每年的房价](https://www.liritian.com/wp-content/uploads/2023/05/551685506521
    .pic_.jpg)
    还好速度在0.247sec里面
    tips:2020 年london房价出事了!但这并不令人意外……

  • 只买最贵的,不买最对的 就是有钱,给王阿姨查一下london最贵的房子价格
    town,
    district,
    count() AS c,
    round(avg(price)) AS price,
    bar(price, 0, 5000000, 100)
    FROM uk_price_paid
    WHERE date >= '2020-01-01'
    GROUP BY
    town,
    district
    HAVING c >= 100
    ORDER BY price DESC
    LIMIT 10;
    london前十
    这个是前十的房子,前一百的图太大截图截不下来。
    哎呀0.675s 表现不行啊,有没有让它更快的方法?

  • 有,构建投影
    让我们通过维度 toYear(date)、district 和 town 创建一个聚合 Projection:
    ALTER TABLE uk_price_paid
    ADD PROJECTION projection_by_year_district_town(
    SELECT
    toYear(date),
    district,
    town,
    avg(price),
    sum(price),
    count()
    GROUP BY
    toYear(date),
    district,
    town
    )
    此时我又查了一下,结果是0.746s,比刚才还慢了。
    填充现有数据的 Projection。 (如果不进行 materialize 操作,则 ClickHouse 只会为新插入的数据创建 Projection):

ALTER TABLE uk_price_paid
MATERIALIZE PROJECTION projection_by_year_district_town
SETTINGS mutations_sync = 1

让我们来做个测试:
分别执行每年的价格、伦敦每年的平均价格、房价前100的sql
具体数值是0.342s、0.934s、0.822,之前的分别是2.571s,0.247s,1.010s 这个数据看起来除了伦敦每年的平均价格都降了一些。

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注