Python で Elasticsearch をいじる

python

elasticsearch

全文検索サーバとしてはいつのまにか Apache Solr よりも Elasticsearch の方がシェアが高くなってしまっていて^*1、最近新規導入するのは Elasticsearch ばかりという感じになってきました。Elasticsearch は単体ではほとんど機能がありませんが、Logstash, Kibana, Beats との組み合わせが強力なんですよね。どちらもベースは Lucene なので、クエリとかは共通なのでそこは助かるんですけどね。

† Python から検索する

今回は Python でバッチを開発するので、Python 用の Elasticsearch Client を使ってみます。

Python Elasticsearch Client — Elasticsearch 8.0.0 documentation

Official low-level client for Elasticsearch. Its goal is to provide common ground for all Elasticsearch-related code in Python; because of this it tries to be opinion-free and very extendable.

クライアントライブラリは Elasticsearch のバージョン毎に用意されているので、pip install elasticsearchX （Xはバージョン）のような感じでインストールしておきます。ページングの処理が独特なので、自前で頑張るよりも以下のように helpers.scan() を使うと、for 文で処理が進むと必要なだけのデータが順次取得されるようになります。

from elasticsearch7 import Elasticsearch, helpers

es_hosts = ['elasticsearch:9200/']
es_index = 'index-*'
lucene_query = '@timestamp:[2020-01-01T00:00:00.000+09:00 TO 2020-12-32T23:59:59.999+09:00]'

es = Elasticsearch(es_hosts, maxsize=20, send_get_body_as='POST')
resp = helpers.scan(client=es, index=es_index, q=lucene_query, size=1000, scroll='5m', _source_includes=['some_field'])
for hit in resp:
     print(hit["_source"]['some_field'])

query は元々 Solr 使いなので、Lucene 形式の方が楽ですかね。