arXiv	https://arxiv.org/abs/2412.03555
論文のライセンス	http://creativecommons.org/licenses/by/4.0/

PaliGemma 2:
A Family of Versatile VLMs for Transfer

Andreas Steiner André Susano Pinto Michael Tschannen Daniel Keysers Xiao Wang Yonatan Bitton Alexey Gritsenko Matthias Minderer Anthony Sherbondy Shangbang Long Siyang Qin Reeve Ingle Emanuele Bugliarello Sahar Kazemzadeh Thomas Mesnard Ibrahim Alabdulmohsin Lucas Beyer Xiaohua Zhai

Abstract

PaliGemma 2は、Gemma 2言語モデルファミリーに基づいたPaliGemmaオープンビジョン言語モデル（VLM）のアップグレード版である。我々は、PaliGemmaでも使用されたSigLIP-So400mビジョンエンコーダーを、2Bモデルから27Bモデルまでの全範囲のGemma 2モデルと組み合わせた。これらのモデルを3つの解像度（224px²、448px²、896px²）で複数段階にわたって訓練し、ファインチューニングによる転移のための幅広い知識を備えさせた。結果として得られた、異なるモデルサイズと解像度をカバーするベースモデルファミリーにより、我々は転移性能に影響を与える要因（学習率など）を調査し、タスクの種類、モデルサイズ、解像度の間の相互作用を分析することが可能となった。さらに、我々はPaliGemmaの範囲を超えて転移タスクの数と幅を拡大した。これには、表構造認識、分子構造認識、楽譜認識などの様々なOCR関連タスク、および長文の詳細なキャプション生成や放射線画像レポート生成が含まれ、PaliGemma 2はこれらのタスクで最先端の結果を得ている。

1 Introduction

PaliGemma [9]は、SigLIP [108]ビジョンエンコーダーと2BのGemma言語モデル [21]を組み合わせた3Bのビジョン言語モデル（VLM）である。これは、様々な異なるビジョンエンコーダーと言語モデルで構成される、はるかに大規模な従来のVLMと同等の性能を示す。我々は今回、PaliGemmaをアップグレードし、その言語モデルコンポーネントをより新しく、より高性能なGemma 2ファミリーの言語モデル [22]に置き換えることで、3つの異なるサイズ（3B、10B、28B）と3つの異なる解像度（224px²、448px²、896px²）の新しいPaliGemma 2ベースVLMを生成した。これらのVLMに幅広い能力を付与するため、我々はPaliGemmaと同じ3段階の訓練レシピを使用した。結果として得られたモデルは微調整を前提に設計されており、[9]で検討された30以上の転移タスク（一般的なキャプショニングやVQAタスク、一部のビデオや指示表現タスクを含む）で評価した場合、PaliGemma 2は同じ解像度とモデルサイズのPaliGemmaをわずかに上回り、より大きなモデルサイズでは大幅な改善を達成した。我々はPaliGemma 2 VLMをオープンウェイトモデルとしてリリースし、PaliGemmaの代替として使用できるようにした。

同様の構成要素から派生し、同じレシピに従って訓練された一連のモデルを手元に用意することで、制御された環境下でモデルサイズと解像度がダウンストリームの性能に与える影響を分析することが可能となる（セクション4.1参照）。例えば、ほぼすべてのタスクが計算能力の追加から恩恵を受ける一方で、我々は解像度の向上による計算能力の増加からより恩恵を受ける転移タスクと、より大規模で高性能な言語モデルによる計算能力の増加からより恩恵を受けるタスクを特定した。また、より大規模なモデルほど、最適な転移学習率が低くなる傾向があることも示した。

我々はまた、[9]で詳細に探究されなかった新しいタスクについても探究している。これには、テキスト検出と認識（4.2節）、表構造認識（4.3節）、分子構造認識（4.4節）、光学的楽譜認識（4.5節）、長文キャプション生成（4.6節）、空間推論（4.7節）、放射線画像レポート生成（4.8節）が含まれる。PaliGemma 2は、これらの多くのタスクで最先端の結果を得ている。最後に、我々はCPU上でのオンデバイス展開のためのPaliGemma 2の低精度バリアントのベンチマークと分析を行っている（4.9節）。

2 Related work

ここ数年、視覚言語モデル（VLM）は、単純なデュアルエンコーダー（対照的）[77, 31, 108]やエンコーダー-デコーダー（キャプション生成）[98, 20, 93, 94]の設計から、事前学習された視覚エンコーダーと事前学習された言語モデルを組み合わせたより高性能な設計[4, 96, 72, 48, 5, 14, 16, 103]へと急速に進化した。一般的に、これらのモデルを転移するには、ゼロショット、フューショット、ファインチューニングの3つのパラダイムが使用される。最近のもう一つの傾向は、モデルをよりユーザーフレンドリーにすることを目的とした「指示チューニング」である[54, 18]。

いくつかの先行研究[45, 66, 92, 109, 35, 9, 34, 19]では、訓練データや計算リソース、解像度、モデルサイズ、特に視覚エンコーダーなどのコンポーネントの品質といった異なる軸に沿ってVLMをスケーリングする効果を調査している。しかし、我々の知る限り、ファインチューニングによる転移に対する画像解像度と言語モデルのサイズの効果を共同で研究した先行研究は存在しない。特に、異なる言語モデルサイズに依存する先行研究では、しばしば異なる研究室から異なるアーキテクチャと訓練レシピを持つモデルを使用している。例えば[92, 35]（[47]は注目すべき例外である）。

3 Model

Refer to caption — 図1: PaliGemma 2は、224px²/ 448px²/896px²の画像をパッチサイズ14px²のSigLIP-400mエンコーダーで処理し、256/1024/4096のトークンを生成する。線形射影の後、画像トークンは入力テキストトークンと連結され、Gemma 2がこのプレフィックスを自己回帰的に回答で補完する。

				Training cost / example
	Vision Encoder	LLM	Params.	224px²	448px²	896px²
PaliGemma 2 13B		Gemma 2 22B	23.0B	11.0	14.6	$\sim$ 123.5
PaliGemma 2 10B	SigLIP-So400m	Gemma 2 29B	29.7B	13.7	18.3	$\sim$ 167.7
PaliGemma 2 28B		Gemma 2 27B	27.7B	18.9	63.5	$\sim$ 155.6

表1: ビジョンエンコーダーのパラメータ数はLLMと比較して小さいが、計算量はLLMにおけるビジョントークンが支配的である。最後の3列は、例ごとの相対的な学習コスト（我々の事前学習設定で測定）を示している。モデルはCloud TPUv5e [24]で学習されているが、896px²の28Bモデルはみ、TPUv5pで学習されており、チップあたり

2.3\times

の速度向上を想定している。

我々は、PaliGemma [9]と全く同じモデリング、学習、およびデータセットアップに従っており、ここでは最も重要な側面を簡潔に要約する。我々は、同じ事前学習済みのSigLIP-So400mビジョンエンコーダー [108, 3]を使用し、その（一連の）埋め込みを線形射影によってGemma 2の入力空間にマッピングする。視覚的埋め込みはテキストプロンプトと組み合わされ、Gemma 2言語モデルに供給される（プレフィル）。その後、言語モデルから自己回帰的にサンプリングすることで予測が得られる（図1参照）。

我々はPaliGemma 2を3段階で事前学習する（ステージ0はコンポーネントの単一モーダル事前学習に対応し、[108]および[21]を参照）。

•

ステージ1では、事前学習済みのSigLIP-So400mとGemma 2のチェックポイント（後処理ステップを経ていない生のチェックポイント）を組み合わせ、ファインチューニングを通じて幅広いタスクへの転移可能性を可能にするよう設計された10億例のマルチモーダルタスク混合で共同で訓練する。画像解像度は224px²であり、このステージでは凍結されるパラメータはない。
•

ステージ2では、まず解像度448px²で5000万例を訓練し、その後896px²の解像度で1000万例を訓練する。タスク混合は同じコンポーネントを持つが、高解像度から恩恵を受けるタスクの重みが増加され、出力シーケンス長が増加される（例えば、長い視覚的テキストのOCR学習を促進するため）。
•

ステージ3では、ステージ1または2（解像度に応じて）のチェックポイントを目標タスクにファインチューニングする。PaliGemmaは、複数の画像や短い動画を含むものを含む、一連の学術的ベンチマークを考慮した。我々は本稿でも同じベンチマークセットを考慮する（[9, Sec. 3.2.4]から同じハイパーパラメータセットを探索）。さらに、文書関連タスク、長文キャプション生成、医療画像理解を含む新しいアプリケーションも探索する。

[22]に従い、我々はGemma 2コンポーネントの注意機構と出力ロジットにロジットソフトキャッピング[6]を適用し、ステージ1と2では[22]と同じパラメータを使用するが、ステージ3では一部の転移タスクで結果が悪化したため適用しない。さらに、我々は全体を通してデフォルトのハイパーパラメータを持つAdamオプティマイザ[42]を使用し、ステージ1と2ではモデルサイズに基づいて学習率を調整する。具体的には、PaliGemmaのステージ1と2で使用された $2\cdot 10^{-5}$ の学習率を、PaliGemma 2 3Bでは0.5倍、PaliGemma 2 10Bと28Bでは0.25倍する。

訓練データの混合に関する詳細については[9, Sec. 3.2.5]を参照されたい。ここでは簡単な要約を提供する。この混合には、キャプション付け、接地されたキャプション付け（[94]のように）、OCR、様々な機械生成の視覚的質問応答（VQA）タスク[11, 75]、検出[13]、およびインスタンスセグメンテーション[15]が含まれる。対応するラベルの多くは機械生成されており、主に公開されている専門モデルに依存している（[9, Sec. 3.2.5]参照）。また、LLaVA[54]のような他のオープンVLMで一般的な大規模商用VLMは使用していない。

PaliGemmaと同様に、我々はPaliGemma 2モデルをCloud TPUv5e Podスライス[24]（28Bモデルの896px²の場合はTPUv5pを除く）の256から1024チップで訓練し、完全シャード化データ並列（FSDP [110, 8]）シャーディング戦略を使用する。PaliGemma 2 3Bは、PaliGemmaとほぼ同じ訓練コスト（ステージ1に256チップを使用して3日間）を要する。他のバリアントと解像度のコストは表1から推測できる。解像度の増加は、言語モデルのサイズ増加と同様の追加コストを伴うことは注目に値する。

4 Experiments

[9]で検討された広範な転移タスクに加えて、我々はテキスト検出と認識（セクション4.2）、表構造認識（セクション4.3）、分子構造認識（セクション4.4）、光学的音楽スコア認識（セクション4.5）、長文キャプション生成（セクション4.6）、空間推論（セクション4.7）、放射線画像レポート生成（セクション4.8）を含む新しいタスクも検討する。

各新規タスクの例を付録Aに、転移の詳細を付録Bに示す。

4.1 Investigating model size and resolution

タスクパフォーマンスに対するモデルサイズと解像度の影響を研究するため、我々は3つのモデルバリアント（3B、10B、28B）を2つの解像度（224px²および448px²）で、[9]が使用した30以上の学術ベンチマークに対してファインチューニングを行った。これらのベンチマークは、自然画像、文書、インフォグラフィックス、動画に関する幅広いキャプション生成、VQA、参照セグメンテーションタスクをカバーしている。我々は以前のPaliGemmaの研究から最適なハイパーパラメータを再利用し、各モデルサイズに対して学習率 $\{0.03,0.06,0.1,0.3,0.6,1.0,3.0\}\cdot 10^{-5}$ のみを探索した。ほとんどのタスクで以前の研究が224px²と448px²に同じハイパーパラメータを使用していたため、我々は224px²の解像度でのみ探索を行い、その選択を両方の解像度に再利用した。各モデルサイズとタスクに対して、それぞれの検証分割に基づいて最適な学習率を選択し、その後モデルを再訓練してテストメトリクスを報告する。完全な結果は表13で確認できる。

4.1.1 Effect on task performance

画像解像度の増加とLMサイズの増加は、どちらもPaliGemma 2モデルの予測（および訓練、表1参照）に費やされるFLOPsの増加につながる。したがって、我々は一般的にほとんどのタスクがこれらの変更の両方から恩恵を受けると予想する。一方で、一部のタスクは入力のより詳細な情報（高解像度）や、より大きなLMが提供する言語理解の向上と増加した世界知識から恩恵を受ける可能性がある。これらの側面をより詳細に理解するために、図3では、PaliGemma 2 3B（224px²）に、解像度を維持したまま9Bの大きなLMを装備した場合（ $3.7\times$ 倍のFLOPs）、またはモデルサイズを維持したまま解像度を448px²に増加させた場合（ $4.6\times$ 倍のFLOPs）の転移メトリクスの相対的な改善を可視化している。

予想通り、ほとんどのタスクは解像度とモデルサイズの増加から同様に恩恵を受けている（緑のマーカー）。テキスト、文書、画面、チャートの理解に焦点を当てたタスクのグループ（黄色のマーカー）があり、これらは主に解像度の増加から恩恵を受けている。対応するベンチマークの画像は、しばしば224px²よりも大幅に大きいネイティブ解像度を持っており、この観察と一致している。別のタスクグループ（青のマーカー）は主にLMサイズの増加から恩恵を受けている。これらのタスクの一部は多言語データ（XM3600（avg35））を含んでいたり、高度な視覚的推論（AI2D、CountBenchQA、NLVR2）を必要としたりする。

図4は、解像度とモデルサイズの関数としてのスケーリング挙動に関する追加の詳細を提供している。モデルサイズを3Bから10Bに増加させるのに比べて、さらに28Bに増加させると、しばしば適度な改善しか得られないか、まったく改善が見られない場合がある。したがって、最大のPaliGemma 2の使用は、可能な限り最高のパフォーマンスを得たい場合や、計算能力や遅延の制約がない場合に有用である。PaliGemma 2 28Bの相対的に悪い転移可能性に関連する可能性のある要因は、基礎となるGemma 2 27Bモデルがゼロから訓練されているのに対し、2Bおよび9Bモデルは蒸留されている点である[22, Sec. 6.1]。

4.1.2 Model size and transfer learning rate

図5は、転移学習率の関数として（正規化された）タスクパフォーマンスを可視化している。一般的な傾向として、我々はより大きなモデルの最適な学習率が小さなモデルよりも低くなる傾向があることを観察した（ヒートマップの対角パターン）。したがって、モデルサイズを増加させる際には、より小さな学習率を探索することを推奨する。さらに、我々は新しいPaliGemma 2 3Bが一般的に、PaliGemmaと比較してより小さな最適な転移学習率を持つことを発見した。

4.1.3 Using Gemma 2 instead of Gemma 1

我々は表EでPaliGemmaとも比較を行った。同じ解像度とモデルサイズ（つまり3B）では、PaliGemma 2モデルが対応するPaliGemmaモデルよりもわずかに良いパフォーマンスを示していることがわかる。30以上の学術ベンチマークの平均では、224px²で0.65、448px²で0.85のスコア向上が見られた。

4.2 Text detection and recognition

我々は、画像から個々の単語の位置特定と認識を含む高度なOCRにPaliGemma 2を適用する。具体的には、出力は{転写、バウンディングボックス}のペアである。HierText競技[57]に従い、我々は単語レベルの精度、再現率、F1値を評価指標として使用する。単語の結果は、正解のバウンディングボックスとのIoUが0.5以上で、転写が正解と一致する場合に真陽性とみなされる。HierTextプロトコルでは、文字の大小、句読点、またはテキストの長さに基づくフィルタリングを行わず、予測を直接正解と比較することに注意されたい。

我々は、ICDAR'15 [36]、Total-Text [17]、MLT17およびMLT19 [68]、HierText [56]、TextOCR [84]、IntelOCR [44]の訓練分割の混合データセットでPaliGemma 2をファインチューニングし、最も一般的に使用されるOCRベンチマークであるICDAR'15とTotal-Textのテストセットで評価を行う。表2に結果を示す：896px²のPaliGemma 2 3Bは最先端のHTS [58]を上回る性能を示している。本稿は、OCR文献で一般的な特定のタスク向けアーキテクチャコンポーネントに依存せず、汎用VLMを単にファインチューニングすることでこの結果が得られたことを強調する。これはPaliGemma 2の多用途なインターフェースを示し、ステージ2と3におけるOCR関連の事前訓練の利点を示している。我々はさらに解像度を下げることを試みたが、予測品質が大幅に低下し、一方でモデルサイズを増やしても改善は見られなかった。

	ICDAR’15 Incidental			Total-Text
	P	R	F1	P	R	F1
HTS	81.9	68.4	74.5	75.7	69.4	72.4
PaliGemma 2 3B 896px²	81.9	70.7	75.9	73.8	74.5	74.2

表2: テキスト検出と認識の性能：896px²のPaliGemma 2モデルは、HierText [57]の評価プロトコルの下で、ICDAR'15 IncidentalとTotal-Textにおいて最先端モデルのHTS [58]を上回る性能を示している。

	FinTabNet				PubTabNet
	S-TEDS	TEDS	GriTS-Top	GriTS-Con	S-TEDS	TEDS	GriTS-Top	GriTS-Con
SOTA	98.9	98.2	99.0	98.6	97.9	96.9	-	-
PaliGemma 2 3B 896px²	99.2	98.9	99.4	99.2	97.6	97.3	98.0	97.8

表3: FinTabNet [111]とPubTabNet [112]におけるテーブル構造認識のPaliGemma 2の結果を最先端の手法と比較している。参照指標は[28, 86, 60, 38]からのものである。

4.3 Table structure recognition

表構造認識の目的は、文書画像から表のテキスト内容、対応する境界ボックス座標、およびHTML形式の表構造を抽出することである。PaliGemma 2をこのタスクに転用するため、我々は2つの一般的なデータセットの訓練分割でファインチューニングを行った。1つはPubTabNet [112]で、PubMed Central Open Access Subset（商用利用コレクション）から得られた516,000枚の表形式データ画像を含む。もう1つはFinTabNet [111]で、S&P 500企業の年次報告書から113,000の財務報告表で構成されている。訓練データから明らかに破損した正解（例えば、画像フレームの外に伸びる境界ボックス）を含む例を除去し、さらにFinTabNetに[86]の改良を適用した。画像はアスペクト比を保持しながら目標解像度にリサイズし、目標入力解像度に合わせて正方形サイズにパディングを行った。

我々はTree Edit Distance Similarity (TEDS) [112]とGrid Table Similarity (GriTS) [85]を用いてモデルの品質を評価した。これらは、セルのテキスト内容、セルのトポロジー/構造、および境界ボックスの品質を測定する2つのメトリクスファミリーである。PaliGemma 2は、これらのメトリクスのほとんどで新たな最高水準を設定した（表3）。我々はさらにモデルサイズを増加させることを試みたが、追加の利点は得られなかった。また、より低い画像解像度を使用すると、品質にわずかな後退が見られた。

4.4 Molecular structure recognition

我々は、分子構造認識のためにPaliGemma 2を探索する。これは分子の描画から分子グラフ構造（SMILESストリングとして表現[99]）を推論するタスクである。訓練データとして、PubChemデータセット[41]から100万個の分子を使用し、Indigoツールキット[71]を用いてレンダリングし、MolScribe[76]に従って様々な描画スタイルとランダムな摂動を加えて拡張した。その後、[76]と同じ評価セットを用いて評価を行った。これはChemDrawライブラリでレンダリングされた5.7kの合成分子画像で構成されている。メトリクスとして完全一致率を使用し、表4に示している。PaliGemma 2は、448px²の解像度を使用した場合、最先端のMolScribeを上回る性能を示した。解像度をさらに上げても、完全一致率の向上にはつながらなかった。

4.5 Optical music score recognition

我々はPaliGemma 2を光学的楽譜認識に適用する：単線のピアノ形式の楽譜画像をデジタル楽譜表現である**kern形式に変換する¹¹1https://www.humdrum.org/rep/kern/。**kern表現は、音程と音価に加えて、アーティキュレーションや小節線などの一般的な楽譜関連情報をエンコードする。

我々は53.7千枚の画像を含むGrandStaffデータセット[79]を使用し、公式の訓練、検証、およびテスト分割を採用する。訓練中は、元の画像と合成的に拡張されたバージョンの両方を使用する。評価は歪みのない元の画像で行われる。評価指標は[80]と同じであり、正規化された平均編集距離に基づいている。具体的には、文字誤り率（CER）は文字レベルでの誤りを数え、記号誤り率（SER）は記号レベル（複数の文字を組み合わせたもの）での誤りを測定し、行誤り率（LER）は**kernエンコーディングの完全な行に基づいている。

結果は表5に示されており、現在の最先端手法[80]の結果も併せて示されている。誤り率は解像度の増加とともに減少し、896px²の解像度で最良の誤り率が得られた。モデルサイズを3Bから10Bに増やしても、さらなる誤り率の低減にはつながらなかった。

4.6 Generating long, fine-grained captions

	Full Match $\uparrow$
MolScribe [76]	93.8
PaliGemma 2 10B 448px²	94.8

表4: ChemDrawデータにおける分子構造認識に対するPaliGemma 2の性能 [76]。

	CER $\downarrow$	SER $\downarrow$	LER $\downarrow$
Sheet Music Tr. [80]	3.9	5.1	13.1
PaliGemma 2 3B 896px²	1.6	2.3	16.7

表5: GrandStaffデータセットにおける楽譜認識に対するPaliGemma 2の性能 [80]。文字誤り率（CER）、記号誤り率（SER）、行誤り率（LER）[%]。

詳細な長文の画像説明文を生成することは、マルチモーダル学習において多くの用途がある。例えば、制御性の高いテキストから画像生成モデルを訓練する場合などである [105, 7]。この課題にPaliGemma 2を適応させるため、我々はDOCCI（Descriptions of Connected and Contrasting Images）[69]データセットでファインチューニングを行った。このデータセットには、平均7.1文（639文字、136単語）の長さの詳細な人手による英語の説明文が付与された15,000枚の画像が含まれている。これらの説明文には、物体の空間的関係、物体の数、テキストのレンダリング、世界知識などが含まれている。

我々はまず、DOCCIの訓練分割でPaliGemma 2をファインチューニングし、[9, Sec. 3.2.4]で提案されているハイパーパラメータの範囲を探索した。テスト分割に基づくパープレキシティスコアによって最も性能の高いモデルを選択し、100枚の画像からなるqual_dev分割で最大192トークンの長さで画像説明文を生成した。次に、生成された各文が画像の内容と事実的に一致している（含意されている）かを評価する人間による評価を実施した（評価プロトコルの詳細については付録B.5を参照）。これらの評価に基づいて、最も事実的に一致したモデルを選択し、訓練分割とテスト分割の和集合で再訓練を行い、その後さらに人間による評価（qual_dev分割で）を行った。表6に示す結果は、ファインチューニングされたPaliGemma 2モデルが、多くの一般的なVLMよりも事実的に一致した文を生成していることを示している。これらのVLMの多くは、PaliGemma 2よりも $10-100\times$ 倍大きな高品質な説明文生成データセットで指示チューニングされていることが多い。予想通り、モデルサイズと解像度を増加させると、事実的一致度が向上することが観察された。

	#par.	#char.	#sent.	NES $\downarrow$
MiniGPT-4	17B	1484	15.6	52.3
mPLUG-Owl2	18B	1459	14.4	48.4
InstructBLIP	17B	1510	14.0	42.6
LLaVA-1.5	17B	1395	14.2	40.6
VILA	17B	1871	18.6	28.6
PaliGemma	13B	1535	18.9	34.3
PaLI-5B	15B	1065	11.3	32.9
PaliGemma 2 448px²	13B	1529	17.7	28.4
PaliGemma 2 448px²	10B	1521	17.5	20.3

表6: DOCCIデータ [69] における長文説明文生成に対するPaliGemma 2の結果。Pali*モデルはDOCCIで448px²でファインチューニングされたモデルである；他のベースラインは幅広いタスクで指示チューニングされている。予測の平均長さ（文字数と文数）、および事実的な不正確さを測定する非含意文（NES）の割合。

4.7 Spatial reasoning

PaliGemma 2のような視覚言語モデル（VLM）は、参照表現理解やセグメンテーションなど、物体の位置特定を含む視覚言語タスクにおいて高い性能を示している[15, 104, 94, 9]。これらのタスクと関連するベンチマークは、しばしば機械生成のアノテーションに依存しており、否定を含む複雑な失敗モードに対して盲目である。

視覚空間推論（VSR）ベンチマーク[53]は、これらの問題を克服するために設計されており、我々はここでPaliGemma 2の空間推論能力を評価するためにこれを使用する。これは分類タスクとして定式化されており、モデルは画像内の物体の空間的関係に関する記述が正しいかどうかを判断する必要がある。PaliGemma 2の柔軟なテキストインターフェースを使用するために、我々はこのベンチマークを真／偽の回答を持つQAタスクとして設定した。表7の結果は、PaliGemma 2が以前の微調整されたモデルを上回る性能を示しており、また微調整により、文献中の強力なゼロショットモデルであるInstructBlip [18]に対しても大幅な改善が見られることを示している。我々は、モデルサイズが大きくなるにつれて顕著な利点が観察され、言語理解の向上による利点を示している一方で、解像度224を超えても改善は見られなかった。

	zs. split	rand. split
Human [53]	95.4
InstructBLIP (zs.) [18]	65.6	-
LXMERT [89]	70.1	61.2
PaliGemma 2 13B 224px²	74.8	81.6
PaliGemma 2 10B 224px²	79.8	86.8

表7: VSR [53]におけるPaliGemma 2の精度（ゼロショットおよびランダムテスト分割）。文献からの微調整済み（LXMERT）およびゼロショット（InstructBLIP）ベースラインを示す。

4.8 Radiography report generation

PaliGemma 2モデルの医療分野における能力を探るため、我々は自動胸部X線レポート生成に適用した。これはX線画像に対する（長文の）キャプション生成タスクとして捉えることができる。我々はMIMIC-CXRデータセット[33, 23]でPaliGemma 2をファインチューニングした。このデータセットには377,000枚の画像（ボストンのベス・イスラエル・ディーコネス・メディカルセンターにおける228,000件の放射線検査に由来）と自由記述の放射線レポートが含まれている。我々は[90]と同じ訓練、検証、テスト分割を使用した。品質を向上させるため、我々はLLM（Gemini 1.5 pro）を使用して過去のX線への言及を削除した。これはモデルがそれらにアクセスできないためである。

我々はRadGraph F1スコア[30]を測定した。これは参照レポートと生成されたレポートからRadGraphを使用して抽出されたエンティティ間のF1スコアである。RadGraphはレポート内の所見の有無だけでなく、それらと画像特徴との関係も考慮に入れる。結果は訓練とチューニング中に除外されたテストデータで報告される。

表8はPaliGemma 2モデルの性能と文献からのベースラインを示している。PaliGemma 2は最先端のRadGraphスコアを獲得した。解像度とモデルサイズの増加はどちらも緩やかな改善をもたらした。

4.9 CPU inference and quantization

	C $\uparrow$	B $\uparrow$	R $\uparrow$	F1 $\uparrow$
Flamingo-CXR [90]	13.8	10.1	29.7	20.5
Med-Gemini-2D [102]	17.5	20.5	28.3	24.4
PaliGemma 2 13B 896px²	19.9	14.6	31.9	28.8
PaliGemma 2 10B 896px²	17.4	15.0	32.4	29.5

表8: MIMIC-CXRデータにおける放射線科レポート生成に関するPaliGemma 2の性能 [33, 23]。CIDEr (C)、BLEU4 (B)、Rouge-L (R)、およびRadGraph F1スコア [%] [30]（臨床指標）を報告する。

場合によっては、アクセラレータのないデバイスでPaliGemma 2の推論を実行したいことがある。我々は、CPUで推論を実行する際の結果として得られる実行時間と品質に関心があり、ここではgemma.cpp²²2https://github.com/google/gemma.cppフレームワークを使用した実験について簡単に紹介する。 gemma.cppは、8ビットのスイッチド浮動小数点量子化をサポートする軽量で移植性の高いC++推論エンジンである（CPU推論の代替オプションには、llama.cpp³³3https://github.com/ggerganov/llama.cpp、XNNPack⁴⁴4https://github.com/google/XNNPACKなどがある）。

Processor	Threads	ViT	Prefill	Extend	Prefill	Extend
		Walltime [s]			Tokens/sec
Apple M1 Max	04+1	1.60	8.2	0.90	032	12
Apple M3 Pro	07+1	0.80	4.4	0.50	059	22
AMD Milan	08+1	0.82	4.9	0.64	053	17
AMD Milan	32+1	0.39	1.8	0.34	144	32
AMD Genoa	08+1	0.36	1.8	0.29	147	37
AMD Genoa	32+1	0.17	0.8	0.27	323	41

表9: 異なるアーキテクチャにおけるgemma.cppベースの実装によるCPUのみの推論速度測定。ファインチューニングされたPaliGemma 2 3B (224px²)のグリーディデコーディングによる推論。プリフィルは260トークンで行われ、デコーディング中に11回の拡張呼び出しが続く。

CPUのみの推論速度を評価するために、我々は4つの異なるアーキテクチャでgemma.cppを使用してPaliGemma 2の推論を実行した。 COCOcapでファインチューニングされたPaliGemma 2 3B (224px²)のチェックポイントと、gemma.cppのPaliGemmaのサンプル画像を使用した。プロンプト「describe this image」は、 $256+4=260$ トークンのプリフィル長（画像+テキスト用）をもたらす。出力応答「A large building with two towers on the water」は11トークンで構成される。すべての実行はバッチサイズ1で行われた。結果は表9に示されており、異なるプロセッサで期待できることの概要を示している（この特定の設定において）。

	COCOcap	TextCaps	AI2D	OKVQA	DocVQA(val)
Jax, F32, 12.1GB	140.0	126.3	175.4	164.0	39.8
gemma.cpp, quantized, 4.0GB	139.8	126.6	175.6	164.1	39.8
relative metric values [%]	199.9	100.2	100.1	100.1	99.9

表10: TPU上のJax/f32推論とCPU上の量子化されたgemma.cppベースの推論との品質比較。ファインチューニングされたPaliGemma 2 3B (224px²)の1回の実行の推論。表13のJaxバージョンとの顕著な違いは、COCOcapとTextCapsにグリーディデコーディングを使用した結果である。相対的な数値は、小数点以下1桁に丸める前の指標値に基づいている。

PaliGemma [9]の評価から、32ビット浮動小数点（f32）から16ビット（bf16）の重みへの移行が品質の損失なしに可能であることをすでに知っている。ここでは、gemma.cppの混合量子化と比較する。表10は、5つのファインチューニングデータセット（様々なタスクをカバーするために選択）に対する品質比較を示している。我々は、これら5つのデータセットそれぞれに対してPaliGemma 2 3B (224px²)を1回ずつファインチューニングした。（表13のJaxバージョンとの顕著な違いは、COCOcapとTextCapsにグリーディデコーディングを使用した結果である。）その後、結果として得られたチェックポイントをJaxと量子化後のgemma.cppの両方で評価した。量子化後の相対的な品質は、実用上の品質差がないことを示している。

5 Conclusion

PaliGemma 2において、我々は幅広いモデルサイズと入力解像度をカバーする新しいオープンウェイトモデルファミリーを提示する。PaliGemma 2は、キャプショニング、VQA、ビデオタスクなど広範囲にわたるタスクで強力な転移性能を獲得している。特に、新たに追加されたより大規模なバリアントは、より大きな計算予算を持つユーザーにとって、PaliGemmaと比較して顕著な改善をもたらす。さらに、我々はPaliGemma 2が、PaliGemmaで考慮されていた範囲を超えたアプリケーションにおいても優れていることを示す。これには音楽、分子、医用画像などの領域が含まれる。

\nobibliography

References

Acharya et al. [2019] M. Acharya, K. Kafle, and C. Kanan. TallyQA: Answering complex counting questions. In AAAI, 2019.
Agrawal et al. [2019] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. NoCaps: Novel object captioning at scale. In ICCV, 2019.
Alabdulmohsin et al. [2023] I. Alabdulmohsin, X. Zhai, A. Kolesnikov, and L. Beyer. Getting vit in shape: Scaling laws for compute-optimal model design. In NeurIPS, 2023.
Alayrac et al. [2022] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
Bai et al. [2023] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv:2308.12966, 2023.
Bello et al. [2016] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning. arXiv:1611.09940, 2016.
Betker et al. [2023] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions. Technical Report, 2023.
Beyer et al. [2022] L. Beyer, X. Zhai, and A. Kolesnikov. Big vision. https://github.com/google-research/big_vision, 2022.
Beyer et al. [2024] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcerver, P. Papalampidi, O. Henaff, X. Xiong, R. Soricut, J. Harmsen, and X. Zhai. PaliGemma: A versatile 3B VLM for transfer. arXiv:2407.07726, 2024.
Biten et al. [2019] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. In ICCV, Oct. 2019.
Changpinyo et al. [2022] S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for VQA are image captions. In NAACL, 2022.
Chen and Dolan [2011] D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
Chen et al. [2022a] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. E. Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022a.
Chen et al. [2022b] X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, A. Kolesnikov, J. Puigcerver, N. Ding, K. Rong, H. Akbari, G. Mishra, L. Xue, A. Thapliyal, J. Bradbury, W. Kuo, M. Seyedhosseini, C. Jia, B. K. Ayan, C. Riquelme, A. Steiner, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI: A jointly-scaled multilingual language-image model. arXiv:2209.06794, 2022b.
Chen et al. [2023] X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J. Wu, P. Voigtlaender, B. Mustafa, S. Goodman, I. Alabdulmohsin, P. Padlewski, D. Salz, X. Xiong, D. Vlasic, F. Pavetic, K. Rong, T. Yu, D. Keysers, X. Zhai, and R. Soricut. PaLI-3 vision language models: Smaller, faster, stronger. arXiv:2310.09199, 2023.
Chen et al. [2024] X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay, S. Shakeri, M. Dehghani, D. Salz, M. Lucic, M. Tschannen, A. Nagrani, H. Hu, M. Joshi, B. Pang, C. Montgomery, P. Pietrzyk, M. Ritter, A. J. Piergiovanni, M. Minderer, F. Pavetic, A. Waters, G. Li, I. Alabdulmohsin, L. Beyer, J. Amelot, K. Lee, A. P. Steiner, Y. Li, D. Keysers, A. Arnab, Y. Xu, K. Rong, A. Kolesnikov, M. Seyedhosseini, A. Angelova, X. Zhai, N. Houlsby, and R. Soricut. PaLI-X: On scaling up a multilingual vision and language model. In CVPR, 2024.
Ch’ng and Chan [2017] C. K. Ch’ng and C. S. Chan. Total-Text: A comprehensive dataset for scene text detection and recognition. In ICDAR, 2017.
Dai et al. [2023] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arxiv:2305.06500, 2023.
Deitke et al. [2024] M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, et al. Molmo and PixMo: Open weights and open data for state-of-the-art multimodal models. arXiv:2409.17146, 2024.
Desai and Johnson [2021] K. Desai and J. Johnson. Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
Gemma Team [2024a] Gemma Team. Gemma: Open models based on gemini research and technology. arXiv:2403.08295, 2024a.
Gemma Team [2024b] Gemma Team. Gemma 2: Improving open language models at a practical size. arXiv:2408.00118, 2024b.
Goldberger et al. [2000] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation, 101(23), 2000.
Google Cloud [20xx] Google Cloud. Introduction to Cloud TPU. https://cloud.google.com/tpu/docs/intro-to-tpu, 20xx. Accessed: 2024-07-04.
Goyal et al. [2017] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
Gurari et al. [2018] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz Grand Challenge: Answering visual questions from blind people. In CVPR, 2018.
Hsu et al. [2021] T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. arXiv:2110.11624, 2021.
Huang et al. [2023] Y. Huang, N. Lu, D. Chen, Y. Li, Z. Xie, S. Zhu, L. Gao, and W. Peng. Improving table structure recognition with visual-alignment sequential coordinate modeling. In CVPR, 2023.
Hudson and Manning [2019] D. Hudson and C. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. CVPR, 2019.
Jain et al. [2022] S. Jain, A. Agrawal, A. Saporta, S. Truong, T. Bui, P. Chambon, Y. Zhang, M. P. Lungren, A. Y. Ng, C. Langlotz, et al. RadGraph: Extracting clinical entities and relations from radiology reports. In NeurIPS Datasets and Benchmarks Track, 2022.
Jia et al. [2021] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
Jocher et al. [2023] G. Jocher, J. Qiu, and A. Chaurasia. Ultralytics YOLO, 2023. URL https://github.com/ultralytics/ultralytics.
Johnson et al. [2019] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-Y. Deng, R. G. Mark, and S. Horng. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
Kar et al. [2024] O. F. Kar, A. Tonioni, P. Poklukar, A. Kulshrestha, A. Zamir, and F. Tombari. BRAVE: Broadening the visual encoding of vision-language models. arXiv:2404.07204, 2024.
Karamcheti et al. [2024] S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic VLMs: Investigating the design space of visually-conditioned language models. arXiv:2402.07865, 2024.
Karatzas et al. [2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, 2015.
Karkkainen and Joo [2021] K. Karkkainen and J. Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In WACV, 2021.
Kawakatsu [2024] T. Kawakatsu. Multi-cell decoder and mutual learning for table structure and character recognition. In ICDAR, 2024.
Kazemzadeh et al. [2014] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP, Oct. 2014.
Kembhavi et al. [2016] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In ECCV, 2016.
Kim et al. [2016] S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, et al. Pubchem substance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2016.
Kingma and Ba [2017] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2017.
Krishna et al. [2017] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In ICCV, 2017.
Krylov et al. [2021] I. Krylov, S. Nosov, and V. Sovrasov. Open images v5 text annotation and yet another mask text spotter. In ACCV, 2021.
Laurençon et al. [2024] H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models? arXiv:2405.02246, 2024.
Lees et al. [2022] A. Lees, V. Q. Tran, Y. Tay, J. Sorensen, J. Gupta, D. Metzler, and L. Vasserman. A new generation of perspective API: Efficient multilingual character-level transformers. arXiv:2202.11176, 2022.
Li et al. [2024] B. Li, H. Zhang, K. Zhang, D. Guo, Y. Zhang, R. Zhang, F. Li, Z. Liu, and C. Li. LLaVA-NeXT: What else influences visual instruction tuning beyond data?, May 2024. URL https://llava-vl.github.io/blog/2024-05-25-llava-next-ablations/.
Li et al. [2023] J. Li, D. Li, S. Savarese, and S. C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
Li et al. [2020] Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget Captioning: Generating natural language description for mobileuser interface elements. In EMNLP, 2020.
Li et al. [2022] Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
Lin et al. [2014] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll’a r, and C. L. Zitnick. Microsoft COCO: common objects in context. arXiv:1405.0312, 2014.
Liu et al. [2021] F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. In EMNLP, Nov. 2021.
Liu et al. [2023a] F. Liu, G. E. T. Emerson, and N. Collier. Visual spatial reasoning. TACL, 11:635–651, 2023a.
Liu et al. [2023b] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In NeurIPS, 2023b.
Lobry et al. [2020] S. Lobry, D. Marcos, J. Murray, and D. Tuia. RSVQA: Visual question answering for remote sensing data. IEEE Trans. on Geoscience and Remote Sensing, 58(12), Dec. 2020.
Long et al. [2022] S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Towards end-to-end unified scene text detection and layout analysis. In CVPR, 2022.
Long et al. [2023] S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. ICDAR 2023 competition on hierarchical text detection and recognition. In ICDAR, 2023.
Long et al. [2024] S. Long, S. Qin, Y. Fujii, A. Bissacco, and M. Raptis. Hierarchical text spotter for joint text spotting and layout analysis. In WACV, 2024.
Lu et al. [2022] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
Ly and Takasu [2023] N. T. Ly and A. Takasu. An end-to-end multi-task learning model for image-based table recognition. arXiv:2303.08648, 2023.
Mao et al. [2016] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
Marino et al. [2019] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
Masry et al. [2022] A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In ACL, May 2022.
Mathew et al. [2020] M. Mathew, D. Karatzas, R. Manmatha, and C. V. Jawahar. DocVQA: A dataset for VQA on document images. arXiv:2007.00398, 2020.
Mathew et al. [2022] M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. InfographicVQA. In WACV, 2022.
McKinzie et al. [2024] B. McKinzie, Z. Gan, J. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers, A. Belyi, H. Zhang, K. Singh, D. Kang, A. Jain, H. Hè, M. Schwarzer, T. Gunter, X. Kong, A. Zhang, J. Wang, C. Wang, N. Du, T. Lei, S. Wiseman, G. Yin, M. Lee, Z. Wang, R. Pang, P. Grasch, A. Toshev, and Y. Yang. MM1: methods, analysis & insights from multimodal LLM pre-training. arXiv:2403.09611, 2024.
Mishra et al. [2019] A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019.
Nayef et al. [2017] N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification - RRC-MLT. In ICDAR, 2017.
Onoe et al. [2024] Y. Onoe, S. Rane, Z. Berger, Y. Bitton, J. Cho, R. Garg, A. Ku, Z. Parekh, J. Pont-Tuset, G. Tanzer, S. Wang, and J. Baldridge. DOCCI: Descriptions of Connected and Contrasting Images. In ECCV, 2024.
Pang [2024] H. Pang. YOLO-DocLayNet, Jan. 2024. URL https://github.com/ppaanngggg/yolo-doclaynet.
Pavlov et al. [2011] D. Pavlov, M. Rybalkin, B. Karulin, M. Kozhevnikov, A. Savelyev, and A. Churinov. Indigo: Universal cheminformatics API. Journal of Cheminformatics, 3(Suppl 1):P4, 2011.
Peng et al. [2023] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
Pfeiffer et al. [2022] J. Pfeiffer, G. Geigle, A. Kamath, J.-M. Steitz, S. Roth, I. Vulić, and I. Gurevych. xGQA: Cross-lingual visual question answering. In ACL, 2022.
Pfitzmann et al. [2022] B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P. Staar. DocLayNet: A large human-annotated dataset for document-layout segmentation. In SIGKDD, 2022.
Piergiovanni et al. [2022] A. Piergiovanni, W. Kuo, and A. Angelova. Pre-training image-language transformers for open-vocabulary tasks. arXiv:2209.04372, 2022.
Qian et al. [2023] Y. Qian, J. Guo, Z. Tu, Z. Li, C. W. Coley, and R. Barzilay. MolScribe: Robust molecular structure recognition with image-to-graph generation. J. Chem. Inf. Model., 63(7), 2023.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021.
Rashkin et al. [2023] H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, and D. Reitter. Measuring attribution in natural language generation models. Computational Linguistics, 49(4):777–840, 2023.
Ríos-Vila et al. [2023] A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical music recognition for pianoform sheet music. IJDAR, 26(3):347–362, 2023.
Ríos-Vila et al. [2024] A. Ríos-Vila, J. Calvo-Zaragoza, and T. Paquet. Sheet Music Transformer: End-to-end optical music recognition beyond monophonic transcription. In ICDAR, 2024.
Schwenk et al. [2022] D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-OKVQA: A benchmark for visual question answering using world knowledge. arXiv:2206.01718, 2022.
Sidorov et al. [2020] O. Sidorov, R. Hu, M. Rohrbach, and A. Singh. TextCaps: A dataset for image captioning with reading comprehension. In ECCV, 2020.
Singh et al. [2019] A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen, D. Parikh, and M. Rohrbach. Towards VQA models that can read. In CVPR, 2019.
Singh et al. [2021] A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner. TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In CVPR, 2021.
Smock et al. [2022] B. Smock, R. Pesala, and R. Abraham. GriTS: Grid table similarity metric for table structure recognition. arXiv:2203.12555, 2022.
Smock et al. [2023] B. Smock, R. Pesala, and R. Abraham. Aligning benchmark datasets for table structure recognition. In ICDAR, 2023.
Suhr et al. [2019] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi. A corpus for reasoning about natural language grounded in photographs. In ACL, 2019.
Susano Pinto et al. [2023] A. Susano Pinto, A. Kolesnikov, Y. Shi, L. Beyer, and X. Zhai. Tuning computer vision models with task rewards. In ICML, 2023.
Tan and Bansal [2019] H. Tan and M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, 2019.
Tanno et al. [2024] R. Tanno, D. Barrett, A. Sellergren, S. Ghaisas, S. Dathathri, A. See, J. Welbl, K. Singhal, S. Azizi, T. Tu, M. Schaekermann, R. May, R. Lee, S. Man, Z. Ahmed, S. Mahdavi, Y. Matias, J. Barral, A. Eslami, D. Belgrave, V. Natarajan, S. Shetty, P. Kohli, P.-S. Huang, A. Karthikesalingam, and I. Ktena. Collaboration between clinicians and vision–language models in radiology report generation. Nature Medicine, 2024.
Thapliyal et al. [2022] A. V. Thapliyal, J. Pont Tuset, X. Chen, and R. Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022.
Tong et al. [2024] S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y. LeCun, and S. Xie. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv:2406.16860, 2024.
Tschannen et al. [2023] M. Tschannen, M. Kumar, A. Steiner, X. Zhai, N. Houlsby, and L. Beyer. Image captioners are scalable vision learners too. In NeurIPS, 2023.
Wan et al. [2024] B. Wan, M. Tschannen, Y. Xian, F. Pavetic, I. Alabdulmohsin, X. Wang, A. S. Pinto, A. Steiner, L. Beyer, and X. Zhai. LocCa: Visual pretraining with location-aware captioners. In NeurIPS, 2024.
Wang et al. [2021] B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2words: Automatic mobile ui summarization with multimodal learning. In Symposium on User Interface Software and Technology, 2021.
Wang et al. [2022a] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang. GIT: A generative image-to-text transformer for vision and language. TMLR, 2022a.
Wang et al. [2019] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
Wang et al. [2022b] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022b.
Weininger [1988] D. Weininger. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988.
Xu et al. [2017] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
Xu et al. [2016] J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
Yang et al. [2024] L. Yang, S. Xu, A. Sellergren, T. Kohlberger, Y. Zhou, I. Ktena, A. Kiraly, F. Ahmed, F. Hormozdiari, T. Jaroensri, E. Wang, E. Wulczyn, F. Jamil, T. Guidroz, C. Lau, S. Qiao, Y. Liu, A. Goel, K. Park, A. Agharwal, N. George, Y. Wang, R. Tanno, D. G. T. Barrett, W.-H. Weng, S. S. Mahdavi, K. Saab, T. Tu, S. R. Kalidindi, M. Etemadi, J. Cuadros, G. Sorensen, Y. Matias, K. Chou, G. Corrado, J. Barral, S. Shetty, D. Fleet, S. M. A. Eslami, D. Tse, S. Prabhakara, C. McLean, D. Steiner, R. Pilgrim, C. Kelly, S. Azizi, and D. Golden. Advancing multimodal medical capabilities of Gemini. arXiv:2405.03162, 2024.
Ye et al. [2024] Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, and F. Huang. mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. In CVPR, 2024.
You et al. [2024] H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y. Yang. Ferret: Refer and ground anything anywhere at any granularity. In ICLR, 2024.
Yu et al. [2022] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.
Yu et al. [2016] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In ECCV, 2016.
Yu et al. [2019] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
Zhai et al. [2023] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023.
Zhang et al. [2024] H. Zhang, M. Gao, Z. Gan, P. Dufter, N. Wenzel, F. Huang, D. Shah, X. Du, B. Zhang, Y. Li, et al. MM1.5: Methods, analysis & insights from multimodal LLM fine-tuning. arXiv:2409.20566, 2024.
Zhao et al. [2023] Y. Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y. Hao, A. Mathews, and S. Li. Pytorch FSDP: experiences on scaling fully sharded data parallel. VLDB, 2023.
Zheng et al. [2021] X. Zheng, D. Burdick, L. Popa, P. Zhong, and N. X. R. Wang. Global Table Extractor (GTE): A framework for joint table identification and cell structure recognition using visual context. In WACV, 2021.
Zhong et al. [2020] X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes. Image-based table recognition: Data, model, and evaluation. In ECCV, 2020.

Contributions and Acknowledgments

Model development contributors

Core Contributors

Andreas Steiner
André Susano Pinto
Michael Tschannen

Contributors

Daniel Keysers
Xiao Wang
Yonatan Bitton
Alexey Gritsenko
Matthias Minderer
Anthony Sherbondy
Shangbang Long
Siyang Qin
Reeve Ingle
Emanuele Bugliarello
Sahar Kazemzadeh
Thomas Mesnard
Ibrahim Alabdulmohsin
Lucas Beyer
Xiaohua Zhai

Lead

Andreas Steiner

Acknowledgments

Jan Wassenberg
Basil Mustafa

Model release contributors
and general support

Gemma Model

Tris Warkentin
Alek Andreev
Armand Joulin
Victor Cotruta
Sanah Choudhry
Nathan Byrd

Open Models Success

Luiz Gustavo Martins
Kat Black
Phil Culliton
Chris Perry
D. Sculley
Sara Smoot

Marketing

Glenn Cameron
Natalie Dao

Kaggle

D. Sculley
Nilay Chauhan
Brenda Flynn
Kinjal Parekh

Developer Relations

Jetha Chan
Joe Fernandez
Ju-yeong Ji

Keras

Divyashree Sreepathihalli
Hongyu Chiu

Vertex AI

Keelin McDonell

Ethics and Safety

Antonia Paterson
Pankil Botadra

Hugging Face Partners

Merve Noyan
Pedro Cuenca
Pablo Montalvo

Nvidia Partners

Dong Meng
Manoj Kilaru
Shyamala Prayaga
Ryan Timbrook
Anna Warno

Ollama Partners

Michael Chiang
Jeffrey Morgan

Executive Sponsors

Raia Hadsell
Joelle Barral
Jeremiah Harmsen
Mat Velloso
Allen Hutchison

Appendix A Tasks

本節では、[9]で検討されたタスクに加えて、PaliGemma 2で追加された転移タスクの訓練例を1つ提供する。

	May 31, 2015
In millions	Gross unrealized losses	Fair value	Gross unrealized losses	Fair Value	Gross unrealized losses	Fair Value
Type of issue:
General obligation municipal bonds	$(3.8)	$355.1	$(0.5)	$26.3	$(4.3)	$561.4
Revenue municipal bonds	$(3.2)	361.6	–	–	(3.2)	361.6
Total	$(7.0)	$896.7	$(0.5)	$26.3	$(7.5)	$923.0

Model	Caption
InstructBLIP	The image features a brown horse standing in a dirt area, with a saddle on its back. The horse appears to be well-trained, as it is standing still and paying attention to its surroundings. There are several people visible in the scene, including one person standing near the horse’s left side, another person standing near the horse’s right side, and a third person standing further away from the horse. In addition to the people, there are two cars parked in the background, one on the left side and another on the right side of the image.
mPLUG-Owl2	In this image we can see a horse on the ground. In the background there is a shed and we can see trees.
MiniGPT-4	This image is of a horse standing in the dirt near a building. The horse has a saddle on its back and is wearing a number 5 on its forehead. There is a barn or stable visible in the background, and some trees and shrubs in the foreground.
LLaVA-1.5	The image features a brown horse standing in a dirt field. The horse is wearing a saddle and bridle, indicating that it is ready for riding. The saddle is adorned with stars, adding a unique touch to the horse’s appearance. The horse appears to be calmly standing in the field, possibly waiting for its rider or simply enjoying the outdoors.
VILA	In the image, a majestic brown horse stands in a fenced area, its coat gleaming under the sunlight. The horse is adorned with a black saddle and bridle, ready for a ride. The number 55 is prominently displayed on its side, perhaps indicating its racing number or a unique identifier. The horse’s tail swishes gently, adding a sense of movement to the scene. The horse is positioned in front of a stone wall, which provides a rustic backdrop to the image. The wall’s texture contrasts with the smoothness of the horse’s coat, creating a visually interesting composition. The ground beneath the horse is covered in dirt, suggesting that this might be a training or riding area. The fence in the background further reinforces this idea, providing a safe and controlled environment for the horse. Overall, this image captures a moment of calm before what could be an exciting ride, with the horse poised and ready for action.
PaLI-5B (FT on DOCCI)	A side view of a brown horse that is standing on a dirt ground. The horse is facing to the right and has a black saddle on it. The horse has a white star on its face. The horse has a bridle around its neck. The horse is standing on a dirt ground that is covered in white rocks. There is a stone wall behind the horse. There is a metal shed behind the horse. There are trees behind the shed.
PaliGemma 3B 448px²	A side view of a brown horse that is standing on a dirt ground. The horse is facing to the left and has a black saddle on its back. The saddle is black and has a brown leather strap on it. The horse has a black rope around its neck. The horse has a white star hanging from its bridle. The horse has a white number "55" written on its back. The horse has white hooves. The horse has a black rope wrapped around its legs. The horse has a brown rope wrapped around its head. The horse has a black rope wrapped around its tail. The horse is standing on a dirt ground that has small white rocks on it. Behind the horse is a stone wall that is made up of different sized rocks. Behind the stone wall is a metal gate that is being held up by metal beams. Behind the gate is a tree line that is made up of green trees.
PaliGemma 2 3B 448px²	A brown horse is standing in a dirt area with rocks scattered around. The horse has a black saddle on its back and a white star painted on its chest. The number "55" is painted on the side of the horse in white. A stone building is behind the horse. A metal structure is in the background of the image with a wooden roof over it. Trees are in the background of the image as well.
PaliGemma 2 10B 448px²	A brown horse is standing in a dirt area with small rocks. The horse has a black saddle on its back and a white star painted on its side. The horse has "55" written on its back in white. There is a pile of horse manure in front of the horse. There is a stone wall behind the horse. There is a wooden structure with a metal roof behind the stone wall. There are trees in the background.

Appendix B Transfer and evaluation details

B.1 Text detection and recognition

すべての実験において、我々は256 TPU-v5eを使用し、バッチサイズ256で15,000ステップのファインチューニングを行った。最大シーケンス長は2048に設定した。学習率 $\{0.01,0.05,0.1,0.5,1.0\}\cdot 10^{-4}$ を実験し、 $10^{-5}$ が最良の結果をもたらすことを見出した。また、0.1のラベルスムージングを使用することで結果が改善されることも分かった。最良の結果は解像度896px²で得られた。

B.2 Table Structure Recognition

我々は、B.1節で説明したテキスト認識と同じ転移学習のセットアップとハイパーパラメータの範囲を使用した。ただし、最大出力長を4096に設定し、ラベルスムージングは使用しなかった。最適なファインチューニングの学習率は $10^{-4}$ である。

Preprocessing

切り取られた表の入力画像は、白色ピクセルで正方形に埋め込まれ、目標の画像解像度にリサイズされる。空でないテーブルセルの境界ボックスは、<locDDDD>の形式の4つのPaliGemma位置トークンを使用してエンコードされる。ここで、DDDDは0000から1023の範囲の量子化された画像位置をエンコードする。ボックスは、テーブルセルの<td> HTMLタグの特別なcoords="<locXMIN><locYMAX><locXMAX><locYMAX>"属性を使用して指定される。無効な表構造や重複するセル境界ボックスを持つ訓練例はスキップされる。FinTabNetの訓練例に対しては、[86]と同様のアプローチに従い、ソースPDFからの情報を使用してセル境界ボックスのアノテーションとセルテキストのアノテーションに追加の修正が適用される。文献[38]で一般的であるように、我々が結果を報告するテスト分割にはフィルタリングは適用されない。

B.3 Molecule structure recognition

すべての実験において、我々は256 TPU-v5eチップを使用し、バッチサイズ256で30,000ステップの事前訓練済みチェックポイントのファインチューニングを行った。学習率は $10^{-4}$ に設定し、ラベルスムージングは0.1、最大出力長は256とした。画像は白色ピクセルで正方形に埋め込み、目標の画像解像度にリサイズした。

B.4 Optical music score recognition

我々は、最大出力長を1024に設定する点を除き、B.3節で説明した訓練設定に従う。

B.5 Generating long, fine-grained captions (DOCCI)

我々は[9, Sec. 3.2.4.]で提案された転移プロトコルとハイパーパラメータに依拠する。

Human evaluation protocol

生成されたキャプションの事実的根拠を評価するために、我々は各文と対応する画像との関係を評価する人間による評価を実施する。評価者には強調表示された文が提示され、「強調表示された文は画像に対してどのような関係にありますか？」と尋ねられる。その後、評価者は「含意」、「中立」、「矛盾」、「評価不可能」の4つの選択肢から選ぶ。これらのカテゴリーは、テキストと視覚コンテンツの事実的整合性を評価するための[78]のフレームワークから適応したものである。例えば、「豚は前足と後ろ足に黒くて丸い蹄があり、ピンクの鼻を持っている」（図12）という文は、画像が明らかにピンクの蹄を示しているため、「矛盾」と評価されるだろう。図1はアノテーションインターフェースを示している。各文は5人の個人によって評価され、多数決の合意が評価結果として使用された。全体的な二値合意は0.8407であり、これは全ての評価者が「含意」カテゴリーで一致した割合を示している。我々は「矛盾」と「中立」の両方を「非含意」と呼ぶ。人間による評価結果の例は表4に示されている。我々は「非含意」文の割合を使用して、最も事実に即したモデルを選択する。

B.6 Spatial reasoning

我々は、64個のTPU-v5eチップを使用してバッチサイズ1024で事前訓練されたチェックポイントを微調整する。最大出力長は18に設定され、これは訓練対象の出力をカバーしている。我々は、学習率を $\{0.1,0.2,1.0,3.0\}\cdot 10^{-6}$ 、重み減衰を $\{0.1,0.3,1.0\}\cdot 10^{-6}$ 、ドロップアウト確率を $\{0.0,0.1,0.2\}$ 、エポック数を $\{1,3,5,10,15,30\}$ の範囲で探索する。

B.7 Radiography report generation

MIMIC-CXRデータセット[33, 23]のレポートは通常、適応: .... 所見: {...}. 印象: {...}という形式を取っている。ここで、適応は放射線科医のための臨床的背景として胸部X線が指示された理由を説明し、所見は画像の顕著な特徴を列挙し、印象は放射線科医による所見の解釈を要約している。

我々は完全なレポートで訓練を行い、予測時には臨床ワークフローを模倣して、適応をモデルへのプレフィックスとして提供する。その後、モデルは所見と印象のセクションを予測する。

448px²の解像度でPaliGemma 2に基づく初期探索の後、我々は、ラベルスムージング、ドロップアウト、重み減衰を行わずに学習率 $5\cdot 10^{-6}$ で8エポックのファインチューニングを行うと、貪欲デコーディングと組み合わせた場合に良好な結果が得られることを見出した。これらの設定を固定し、より高い解像度とモデルサイズに対して学習率を再度探索し、 $\{0.03,0.1,0.3,1.0,5.0\}\cdot 10^{-4}$ の範囲の学習率を検討した。

Appendix C Object detection

	224px²			448px²			896px²
	PG1 3B	PG2 3B	PG2 10B	PG1 3B	PG2 3B	PG2 10B	PG1 3B	PG2 3B	PG2 10B
COCO	28.7	30.4	30.3	37.0	38.5	39.2	41.1	42.3	43.6
DocLayNet	50.8	46.7	50.4	64.1	62.5	63.5	66.5	66.1	66.0

表11: 検出タスクへの転移後の平均精度（mAP）。PG1とPG2はそれぞれPaliGemma [9]とPaliGemma 2を指す。

物体検出は、PaLIおよびPaliGemmaファミリーのすべてのメンバーで事前学習タスクとして使用されており、幅広いタスクにおける下流のパフォーマンスを向上させる [14]。転移において、PaliGemmaは参照表現理解やセグメンテーションなどの位置特定タスクで最先端またはそれに近いパフォーマンスを示している。これは、PaliGemmaが古典的な物体検出タスクでどの程度のパフォーマンスを発揮するかという疑問を提起する。我々は、PaliGemmaをMS COCO [51]およびDocLayNetドキュメントレイアウト検出ベンチマーク [74]に転移させることでこれをテストした。

両タスクにおいて、我々はpix2seqのシーケンス拡張アプローチ [13]に触発された転移戦略を使用する。我々は接頭辞として「すべてのクラスを検出\n」を使用する。接尾辞（ターゲットシーケンス）では、まずランダムな順序ですべての注釈付きオブジェクトのボックス座標とクラス名を提供する。その後、接尾辞は最大シーケンス長までノイズボックスで埋められる。各ノイズボックスはランダムな座標と、クラス名の代わりに専用の<noise>トークンで構成される。学習中、ノイズボックスの座標トークンには損失が適用されず、<noise>クラストークンは通常通り損失を受ける。この拡張は、モデルにより多くのボックスを出力するよう学習させる。さらに、<noise>トークンに割り当てられる確率の形で、予測が実際のオブジェクトを表す確信度をモデルが表現するメカニズムを提供する。推論時には、<noise>および<EOS>トークンはサンプリングから除外される。クラストークンの尤度は信頼度スコアとして使用される。

COCOについては、我々は50エポックの訓練を行った。結果は表11に示されている。予想通り、性能は解像度に大きく依存している。また、より優れた言語モデルによる小さいが一貫した改善も観察された。896px²での性能は、以前のシーケンスベースのアプローチ[13]とほぼ同等であるが、ViTDet[50]のような特殊な検出アーキテクチャには及ばない。

DocLayNetについては、我々は同じシーケンス拡張アプローチに従い、50エポックの訓練を行った。結果はCOCOと同様に、解像度とGemma 2モデルのサイズが増加するにつれて性能が向上するが、この課題ではGemma 1がGemma 2と同等の性能を示している（表11）。COCOと同様に、特殊な検出器がこの課題でより良い性能を示している（例えば、YOLOv11[32]は79.5 mAPに達している[70]）。

これらの結果は、他の多くのタスクとは対照的に、古典的な検出がPaliGemmaのような汎用VLMにとって課題となることを示している。我々は、モデルの本質的な物体理解能力が制限要因ではないと仮説を立てている。なぜなら、視覚的質問応答や参照表現理解タスクでは良好な性能を示しているからである。代わりに、性能は平均精度メトリクス（多数の予測と正確な信頼度スコアを重視する）と言語モデリング目的との不一致によって制限されている可能性がある。タスク固有の報酬を用いたファインチューニング[88]がこの制限に対処できる可能性があるが、これは本稿で我々が提案するPaliGemmaのシンプルな転移アプローチの範囲を超えている。

Appendix D Ethics and Safety

品質関連の指標に加えて、我々は新しいPaliGemma 2 VLMsを倫理と安全性に関連するいくつかのカテゴリーについても評価する。これらの評価には、Gemma 2 [22]で使用されたアプローチに従い、児童の安全性、コンテンツの安全性、表現による害に関するプロンプトが含まれるが、画像キャプション生成と視覚的質問応答（VQA）のセットアップを用いている。

さらに、我々は[15]で使用されたセットアップにも従い、Perspective API [46]を閾値 $>0.8$ で使用して、Fairfaceデータセット[37]から取得した画像に対してPaliGemma 2 VLMsが生成した画像キャプションにおける有害性、冒涜性、その他の潜在的な問題の存在を検出する。我々は、知覚された性別、民族、年齢の属性それぞれについて、サブグループ間で観察された最大値と中央値を報告する。表12に全体的な結果を示す。全体として、我々はすべてのスライスとモデルにわたって、有害性と冒涜性などの低いレベルを観察する。さらに、すべてのPaliGemma 2モデルは同等の性能を示している。

Metric	Perceived Gender			Ethnicity			Age Group
	3B	10B	28B	3B	10B	28B	3B	10B	28B
	Maximum
Toxicity	0.14	0.15	0.19	0.29	0.39	0.39	0.26	0.18	0.32
Identity Attack	0.04	0.02	0.02	0.13	0.06	0.06	0.06	0.03	0.06
Insult	0.17	0.25	0.17	0.37	0.52	0.52	0.27	0.39	0.24
Threat	0.55	0.43	0.57	0.83	0.48	0.48	0.64	0.43	0.64
Profanity	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	Median
Toxicity	0.13	0.10	0.18	0.07	0.07	0.14	0.12	0.08	0.12
Identity Attack	0.02	0.01	0.02	0.00	0.00	0.00	0.00	0.00	0.00
Insult	0.15	0.23	0.14	0.14	0.17	0.13	0.09	0.18	0.16
Threat	0.35	0.27	0.41	0.28	0.19	0.42	0.27	0.31	0.40
Profanity	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00

表12: Perspective API [46]を使用してFairFace [37]上でPaliGemma 2 VLMsが生成したキャプションの安全性統計。数値は閾値

\geq 0.8

のインスタンスの割合を[%]で示しており、例えば0.09は0.09%を意味する。

Appendix E Detailed results

	224px²			448px²
	3B	10B	28B	3B	10B	28B
AI2D [40]	174.7 ( $\pm 0.5$ )	183.1 ( $\pm 0.4$ )	183.2 ( $\pm 0.7$ )	176.0 ( $\pm 0.2$ )	184.4 ( $\pm 0.4$ )	184.6 ( $\pm 0.4$ )
AOKVQA-DA (val) [81]	164.2 ( $\pm 0.5$ )	168.9 ( $\pm 0.3$ )	170.2 ( $\pm 0.2$ )	167.9 ( $\pm 0.3$ )	170.8 ( $\pm 0.5$ )	171.2 ( $\pm 0.2$ )
AOKVQA-MC (val) [81]	179.7 ( $\pm 1.0$ )	183.7 ( $\pm 1.1$ )	184.7 ( $\pm 0.8$ )	182.5 ( $\pm 0.4$ )	185.9 ( $\pm 0.2$ )	187.0 ( $\pm 0.3$ )
ActivityNet-CAP [43]	134.2 ( $\pm 0.3$ )	135.9 ( $\pm 0.5$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )
ActivityNet-QA [107]	151.3 ( $\pm 0.2$ )	153.2 ( $\pm 0.4$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )
COCO-35L (avg34) [91]	113.9 ( $\pm 0.2$ )	115.8 ( $\pm 0.0$ )	116.5 ( $\pm 0.1$ )	115.8 ( $\pm 0.3$ )	117.2 ( $\pm 0.1$ )	117.2 ( $\pm 0.1$ )
COCO-35L (en) [91]	138.4 ( $\pm 0.2$ )	140.8 ( $\pm 0.3$ )	142.4 ( $\pm 0.4$ )	140.4 ( $\pm 0.4$ )	142.4 ( $\pm 0.4$ )	142.3 ( $\pm 0.8$ )
COCOcap[51]	141.3 ( $\pm 0.5$ )	143.7 ( $\pm 0.2$ )	144.0 ( $\pm 0.3$ )	143.4 ( $\pm 0.4$ )	145.0 ( $\pm 0.3$ )	145.2 ( $\pm 0.4$ )
ChartQA (aug) [63]	174.4 ( $\pm 0.7$ )	174.2 ( $\pm 0.8$ )	168.9 ( $\pm 0.6$ )	189.2 ( $\pm 0.4$ )	190.1 ( $\pm 0.5$ )	185.1 ( $\pm 0.2$ )
ChartQA (human) [63]	142.0 ( $\pm 0.3$ )	148.4 ( $\pm 1.1$ )	146.8 ( $\pm 0.6$ )	154.0 ( $\pm 0.6$ )	166.4 ( $\pm 0.5$ )	161.3 ( $\pm 0.6$ )
CountBenchQA [9]	181.0 ( $\pm 1.0$ )	184.0 ( $\pm 1.4$ )	186.4 ( $\pm 1.6$ )	182.0 ( $\pm 1.2$ )	185.3 ( $\pm 1.7$ )	187.4 ( $\pm 1.0$ )
DocVQA (val) [64]	139.9 ( $\pm 0.3$ )	143.9 ( $\pm 0.6$ )	144.9 ( $\pm 0.4$ )	173.6 ( $\pm 0.3$ )	176.6 ( $\pm 0.5$ )	176.1 ( $\pm 0.4$ )
GQA[29]	166.2 ( $\pm 0.3$ )	167.2 ( $\pm 0.2$ )	167.3 ( $\pm 0.2$ )	168.1 ( $\pm 0.2$ )	168.3 ( $\pm 0.3$ )	168.3 ( $\pm 0.1$ )
InfoVQA (val) [65]	125.2 ( $\pm 0.2$ )	133.6 ( $\pm 0.2$ )	136.4 ( $\pm 0.1$ )	137.5 ( $\pm 0.3$ )	147.8 ( $\pm 0.2$ )	146.7 ( $\pm 0.4$ )
MARVL (avg5) [52]	183.5 ( $\pm 0.2$ )	189.5 ( $\pm 0.2$ )	190.6 ( $\pm 0.2$ )	182.7 ( $\pm 0.3$ )	189.1 ( $\pm 0.0$ )	189.7 ( $\pm 0.1$ )
MSRVTT-CAP [101]	168.5 ( $\pm 1.3$ )	172.1 ( $\pm 0.5$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )
MSRVTT-QA [100]	150.5 ( $\pm 0.1$ )	151.9 ( $\pm 0.1$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )
MSVD-QA [12]	161.1 ( $\pm 0.2$ )	162.5 ( $\pm 0.2$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )
NLVR2 [87]	191.4 ( $\pm 0.1$ )	193.9 ( $\pm 0.2$ )	194.2 ( $\pm 0.1$ )	191.6 ( $\pm 0.2$ )	193.7 ( $\pm 0.2$ )	194.1 ( $\pm 0.2$ )
NoCaps [2]	123.1 ( $\pm 0.3$ )	126.3 ( $\pm 0.4$ )	127.1 ( $\pm 0.3$ )	123.5 ( $\pm 0.3$ )	126.9 ( $\pm 0.1$ )	127.0 ( $\pm 0.2$ )
OCR-VQA [67]	173.4 ( $\pm 0.0$ )	174.7 ( $\pm 0.1$ )	175.3 ( $\pm 0.2$ )	175.7 ( $\pm 0.1$ )	176.3 ( $\pm 0.1$ )	176.6 ( $\pm 0.1$ )
OKVQA [62]	164.2 ( $\pm 0.1$ )	168.0 ( $\pm 0.1$ )	171.2 ( $\pm 0.2$ )	164.1 ( $\pm 0.4$ )	168.6 ( $\pm 0.5$ )	170.6 ( $\pm 0.2$ )
RSVQA-hr (test) [55]	192.7 ( $\pm 0.1$ )	192.6 ( $\pm 0.0$ )	192.7 ( $\pm 0.0$ )	192.8 ( $\pm 0.0$ )	192.8 ( $\pm 0.1$ )	192.8 ( $\pm 0.1$ )
RSVQA-hr (test2) [55]	190.9 ( $\pm 0.1$ )	190.8 ( $\pm 0.1$ )	190.9 ( $\pm 0.1$ )	190.7 ( $\pm 0.2$ )	190.7 ( $\pm 0.2$ )	190.8 ( $\pm 0.1$ )
RSVQA-lr [55]	193.0 ( $\pm 0.4$ )	192.8 ( $\pm 0.6$ )	193.5 ( $\pm 0.2$ )	192.7 ( $\pm 0.8$ )	193.1 ( $\pm 0.6$ )	193.7 ( $\pm 0.4$ )
RefCOCO (testA) [106]	175.7 ( $\pm 0.2$ )	177.2 ( $\pm 0.1$ )	176.8 ( $\pm 0.1$ )	178.6 ( $\pm 0.3$ )	179.7 ( $\pm 0.1$ )	179.3 ( $\pm 0.1$ )
RefCOCO (testB) [106]	171.0 ( $\pm 0.3$ )	174.2 ( $\pm 0.3$ )	173.9 ( $\pm 0.1$ )	173.5 ( $\pm 0.1$ )	176.2 ( $\pm 0.3$ )	174.8 ( $\pm 0.1$ )
RefCOCO (val) [106]	173.4 ( $\pm 0.1$ )	175.9 ( $\pm 0.1$ )	175.0 ( $\pm 0.0$ )	176.3 ( $\pm 0.1$ )	178.2 ( $\pm 0.1$ )	177.3 ( $\pm 0.1$ )
RefCOCO+ (testA) [39]	172.7 ( $\pm 0.2$ )	174.7 ( $\pm 0.2$ )	173.6 ( $\pm 0.2$ )	176.1 ( $\pm 0.2$ )	177.7 ( $\pm 0.2$ )	176.6 ( $\pm 0.1$ )
RefCOCO+ (testB) [39]	164.2 ( $\pm 0.2$ )	168.4 ( $\pm 0.3$ )	167.1 ( $\pm 0.1$ )	167.0 ( $\pm 0.3$ )	171.1 ( $\pm 0.2$ )	168.6 ( $\pm 0.1$ )
RefCOCO+ (val) [39]	168.6 ( $\pm 0.1$ )	172.0 ( $\pm 0.2$ )	170.3 ( $\pm 0.2$ )	172.1 ( $\pm 0.3$ )	174.4 ( $\pm 0.1$ )	172.8 ( $\pm 0.1$ )
RefCOCOg (test) [61]	169.0 ( $\pm 0.2$ )	171.9 ( $\pm 0.1$ )	170.7 ( $\pm 0.1$ )	172.7 ( $\pm 0.1$ )	174.8 ( $\pm 0.1$ )	173.7 ( $\pm 0.1$ )
RefCOCOg (val) [61]	168.3 ( $\pm 0.3$ )	171.4 ( $\pm 0.2$ )	170.5 ( $\pm 0.1$ )	172.3 ( $\pm 0.2$ )	174.4 ( $\pm 0.1$ )	173.0 ( $\pm 0.1$ )
ST-VQA (val) [10]	161.9 ( $\pm 0.1$ )	164.3 ( $\pm 0.4$ )	165.1 ( $\pm 0.4$ )	180.5 ( $\pm 0.1$ )	182.0 ( $\pm 0.3$ )	181.8 ( $\pm 0.1$ )
SciCap [27]	165.1 ( $\pm 0.5$ )	159.5 ( $\pm 0.7$ )	156.9 ( $\pm 1.0$ )	183.3 ( $\pm 0.7$ )	177.2 ( $\pm 0.3$ )	172.7 ( $\pm 1.5$ )
ScienceQA [59]	196.1 ( $\pm 0.3$ )	198.2 ( $\pm 0.2$ )	198.2 ( $\pm 0.2$ )	196.2 ( $\pm 0.2$ )	198.5 ( $\pm 0.2$ )	198.6 ( $\pm 0.2$ )
Screen2Words [95]	113.3 ( $\pm 0.8$ )	117.8 ( $\pm 0.7$ )	122.8 ( $\pm 0.5$ )	114.0 ( $\pm 0.5$ )	119.1 ( $\pm 1.9$ )	123.4 ( $\pm 0.8$ )
TallyQA (complex) [1]	170.3 ( $\pm 0.3$ )	173.4 ( $\pm 0.1$ )	174.2 ( $\pm 0.1$ )	173.6 ( $\pm 0.2$ )	176.7 ( $\pm 0.3$ )	176.8 ( $\pm 0.2$ )
TallyQA (simple) [1]	181.8 ( $\pm 0.1$ )	183.2 ( $\pm 0.1$ )	183.4 ( $\pm 0.1$ )	185.3 ( $\pm 0.1$ )	186.2 ( $\pm 0.1$ )	185.7 ( $\pm 0.1$ )
TextCaps [82]	127.5 ( $\pm 0.3$ )	137.9 ( $\pm 0.3$ )	139.9 ( $\pm 0.4$ )	152.1 ( $\pm 0.3$ )	157.7 ( $\pm 0.7$ )	153.6 ( $\pm 0.5$ )
TextVQA (val) [83]	159.6 ( $\pm 0.3$ )	164.0 ( $\pm 0.3$ )	164.7 ( $\pm 0.2$ )	175.2 ( $\pm 0.2$ )	176.6 ( $\pm 0.1$ )	176.2 ( $\pm 0.1$ )
VATEX [97]	180.8 ( $\pm 0.4$ )	182.7 ( $\pm 0.5$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )	100-0 ( $\pm 0.0$ )
VQAv2 (minival) [25]	183.0 ( $\pm 0.2$ )	184.3 ( $\pm 0.2$ )	184.5 ( $\pm 0.1$ )	184.8 ( $\pm 0.2$ )	185.8 ( $\pm 0.1$ )	185.8 ( $\pm 0.2$ )
VizWizVQA (val) [26]	176.4 ( $\pm 0.4$ )	178.1 ( $\pm 0.4$ )	178.7 ( $\pm 0.2$ )	177.5 ( $\pm 0.2$ )	178.6 ( $\pm 0.4$ )	178.9 ( $\pm 0.5$ )
WidgetCap [49]	138.1 ( $\pm 0.7$ )	139.8 ( $\pm 1.0$ )	138.8 ( $\pm 0.8$ )	151.4 ( $\pm 0.8$ )	151.9 ( $\pm 0.4$ )	148.9 ( $\pm 0.7$ )
XM3600 (avg35) [91]	142.8 ( $\pm 0.1$ )	144.5 ( $\pm 0.1$ )	145.2 ( $\pm 0.1$ )	143.2 ( $\pm 0.1$ )	144.6 ( $\pm 0.1$ )	145.2 ( $\pm 0.1$ )
XM3600 (en) [91]	179.8 ( $\pm 0.7$ )	180.7 ( $\pm 0.3$ )	181.0 ( $\pm 0.9$ )	180.3 ( $\pm 0.8$ )	181.5 ( $\pm 0.4$ )	181.0 ( $\pm 0.2$ )
xGQA (avg7) [73]	158.6 ( $\pm 0.2$ )	161.4 ( $\pm 0.1$ )	161.1 ( $\pm 0.1$ )	160.4 ( $\pm 0.2$ )	162.6 ( $\pm 0.2$ )	162.1 ( $\pm 0.3$ )

表13: PaliGemma 3B、10B、28Bモデルの224px²および448px²解像度における、[9]からの30以上の学術タスクに対する5回のファインチューニング実行の平均と標準偏差。タスクの分割、前処理、評価指標、ハイパーパラメータは、先行研究に従って224px²バージョンに準拠している。学習率のみがバリデーション分割に基づいてモデルサイズごとに選択されている。

表14: 224px²解像度における様々なタスクとモデルサイズに対する学習率の探索。我々はすべての評価指標で数値を報告しているが、学習率の選択はバリデーション分割に基づいて行われ、ゼロショット数値に基づいては行われていない。

		3e-7	6e-7	1e-6	3e-6	6e-6	1e-5	3e-5
Task	Model
	3B	61.8	67.6	70.6	75.0	76.9	75.1	68.8
AI2D (minival)	10B	80.0	82.9	85.3	84.4	82.9	82.1	69.2
	28B	81.9	82.3	83.2	85.9	85.0	83.4	75.7
AOKVQA-DA (val)	3B	59.3	62.9	64.0	64.6	63.6	59.3	52.8
	10B	67.7	68.6	68.8	66.6	64.6	57.3	50.5
	28B	69.7	70.2	69.8	69.0	66.3	60.8	51.1
	3B	76.9	78.7	79.4	80.8	77.2	76.9	63.8
AOKVQA-MC (val)	10B	83.8	83.3	83.3	82.7	79.4	75.5	56.1
	28B	83.3	84.0	85.1	82.5	82.4	78.2	58.4
ActivityNet-CAP (minival)	3B	26.1	28.5	28.5	30.6	30.0	30.6	29.8
ActivityNet-CAP (minival)	10B	28.6	31.4	30.8	31.6	30.0	31.1	28.6
ActivityNet-QA (minival)	3B	43.3	46.8	49.4	52.6	53.8	53.5	52.0
	10B	49.9	52.2	53.9	55.0	55.3	54.6	51.2
COCO-35L (avg34)	3B	110.1	111.8	113.6	113.9	113.6	113.2	111.7
	10B	115.4	115.8	115.2	113.6	112.9	112.2	111.7
	28B	116.7	116.6	115.4	114.0	112.1	111.2	109.6
	3B	137.9	138.6	139.1	138.4	137.6	136.5	133.8
COCO-35L (en)	10B	140.6	140.3	139.6	137.3	135.5	133.8	132.5
	28B	142.5	141.3	140.4	137.7	134.5	133.2	129.9
COCOcap (minival)	3B	146.3	146.7	145.4	147.2	147.1	147.0	142.0
	10B	148.3	149.4	148.2	148.3	147.0	146.5	143.6
	28B	148.8	149.5	149.2	149.5	148.2	145.3	145.7
	3B	60.8	64.3	66.0	69.7	69.5	68.4	63.6
ChartQA (aug) (minival)	10B	69.0	68.6	71.1	69.5	69.9	68.4	60.4
	28B	66.8	63.4	65.2	66.7	66.0	64.1	55.9
ChartQA (human) (minival)	3B	41.4	42.8	42.7	44.1	43.2	42.9	35.4
	10B	50.9	50.8	50.8	49.2	47.0	44.5	34.6
	28B	48.3	46.9	47.7	46.5	45.3	41.8	33.8
	3B	82.7	82.9	82.0	79.0	82.0	78.0	70.4
CountBenchQA	10B	88.2	84.7	85.1	82.9	81.4	78.2	65.7
	28B	87.8	88.4	88.4	88.6	86.7	83.3	69.6
DocVQA (val)	3B	37.8	37.9	37.3	39.4	40.2	38.7	32.5
	10B	42.4	40.9	42.2	44.1	41.4	39.8	29.6
	28B	42.7	42.1	43.1	45.2	42.1	40.5	30.9
	3B	70.9	72.2	72.9	73.9	73.9	73.8	72.4
GQA (minival)	10B	73.6	74.3	74.7	74.4	74.4	74.2	71.5
	28B	73.7	73.9	74.7	74.8	74.6	74.1	72.3
InfoVQA (val)	3B	21.6	22.9	23.8	25.4	25.2	25.1	22.3
	10B	33.4	33.5	33.2	33.2	32.2	29.8	21.7
	28B	36.9	36.6	36.3	36.2	35.5	34.1	25.4
	3B	69.9	73.4	77.1	81.2	83.0	82.4	69.9
MARVL (avg5)	10B	86.5	88.2	89.2	89.4	89.1	87.4	67.6
	28B	86.7	88.5	89.5	90.3	90.8	89.2	76.2
MSRVTT-CAP (minival)	3B	62.8	66.1	67.8	67.6	72.6	74.0	68.3
MSRVTT-CAP (minival)	10B	70.4	71.5	75.3	74.0	66.2	69.4	67.2
MSRVTT-QA (minival)	3B	44.1	47.0	48.5	51.1	52.0	51.2	49.9
	10B	49.3	51.2	51.9	53.2	53.1	52.1	49.7
MSVD-QA (minival)	3B	55.2	57.8	60.7	63.3	63.1	61.3	57.0
MSVD-QA (minival)	10B	61.1	63.9	65.4	64.2	63.2	63.0	56.3
	3B	82.5	86.2	88.2	90.4	90.9	90.2	85.9
NLVR2 (minival)	10B	91.8	93.0	93.3	93.3	92.5	91.7	86.1
	28B	92.2	92.8	93.6	93.7	93.7	92.2	88.0
NoCaps	3B	123.3	123.6	124.0	123.4	122.5	120.5	112.3
	10B	126.7	126.1	126.0	125.2	122.1	120.5	111.5
	28B	127.5	127.5	126.5	124.0	123.0	120.3	113.0
	3B	72.6	73.1	73.4	73.4	73.2	72.9	70.6
OCR-VQA (minival)	10B	74.7	74.5	74.3	73.9	73.5	73.0	70.6
	28B	75.5	75.5	75.2	74.8	73.9	72.5	71.0
OKVQA (minival)	3B	49.4	52.3	54.3	57.6	56.2	52.9	47.2
	10B	57.8	60.5	61.3	60.8	58.7	55.6	44.1
	28B	64.6	64.4	65.4	63.8	60.6	56.8	46.4
	3B	92.8	93.2	93.3	93.0	93.3	93.4	93.3
RSVQA-hr (minival)	10B	93.3	93.2	93.1	93.0	93.4	93.3	89.4
	28B	93.1	93.4	93.3	93.3	93.3	93.3	92.9
RSVQA-lr (minival)	3B	90.7	92.4	92.7	93.3	92.1	92.2	92.3
	10B	92.3	92.7	92.0	91.7	91.8	92.8	92.0
	28B	91.8	92.1	92.4	92.7	92.9	92.9	92.3
	3B	73.1	74.5	75.3	75.5	75.8	75.8	74.1
RefCOCO (testA)	10B	76.7	76.9	77.1	77.2	77.1	76.1	71.6
	28B	76.2	76.7	76.8	76.8	76.6	75.5	71.6
RefCOCO (testB)	3B	68.0	70.1	70.8	71.2	70.8	70.9	69.7
	10B	73.8	74.3	74.3	74.2	73.4	73.4	68.6
	28B	73.0	73.9	73.8	72.8	73.1	72.0	68.4
	3B	70.4	72.1	73.0	73.2	73.3	73.4	71.6
RefCOCO (val)	10B	75.1	75.6	75.8	76.1	75.6	74.9	70.6
	28B	74.6	75.0	75.2	74.8	74.6	74.0	69.9
RefCOCO+ (testA)	3B	67.6	70.1	70.8	71.8	72.2	72.7	71.0
	10B	72.9	73.5	74.0	75.0	74.9	74.2	69.0
	28B	72.7	73.4	73.4	74.0	74.3	72.9	69.3
	3B	55.3	58.6	60.5	62.9	63.2	64.6	63.8
RefCOCO+ (testB)	10B	66.0	67.1	67.3	68.4	68.2	67.9	62.6
	28B	65.3	66.4	67.1	67.5	67.8	67.0	62.7
RefCOCO+ (val)	3B	61.3	64.2	65.8	67.0	67.9	68.6	67.5
	10B	69.8	70.8	71.1	72.0	71.8	71.3	66.5
	28B	69.0	70.0	70.4	70.8	71.0	70.4	65.7
	3B	65.5	67.2	68.4	68.7	68.9	69.0	67.2
RefCOCOg (test)	10B	70.9	71.6	71.6	71.7	71.3	70.4	65.2
	28B	69.9	70.5	70.8	70.7	70.6	69.7	64.9
RefCOCOg (val)	3B	65.2	67.0	67.8	68.0	68.0	68.2	66.1
	10B	70.8	71.4	71.4	71.4	71.0	70.0	64.9
	28B	69.9	70.4	70.2	70.2	70.1	69.2	64.0
	3B	56.1	58.8	60.4	61.5	62.3	61.2	57.0
ST-VQA (val)	10B	60.9	62.9	63.8	64.0	63.9	61.2	54.8
	28B	63.0	64.4	65.2	65.5	64.3	62.6	55.7
SciCap (minival)	3B	55.2	67.4	76.9	109.4	130.3	138.8	148.1
	10B	78.6	92.5	106.2	128.1	136.9	143.2	143.8
	28B	80.3	94.7	104.0	125.9	136.2	140.1	141.7
	3B	87.7	92.1	94.5	95.1	95.2	94.3	91.4
ScienceQA (minival)	10B	96.9	97.1	97.6	97.6	97.1	96.2	93.7
	28B	96.8	97.1	97.4	97.2	96.8	96.1	94.2
Screen2Words (minival)	3B	95.1	104.2	109.0	109.3	113.2	112.5	110.1
	10B	110.9	115.4	118.2	118.1	114.7	113.0	110.0
	28B	113.0	119.5	120.4	118.8	116.2	114.2	106.3
	3B	66.6	67.8	68.6	70.0	70.0	70.5	66.7
TallyQA (complex)	10B	72.0	72.5	73.4	73.5	72.7	72.0	65.8
	28B	73.1	73.5	73.9	74.8	73.8	73.0	68.1
TallyQA (simple)	3B	80.4	81.1	81.3	81.8	81.9	81.5	79.1
	10B	83.0	83.3	83.1	83.2	82.7	82.1	79.1
	28B	82.9	83.3	83.3	83.5	83.0	82.2	79.7
	3B	122.8	131.9	136.5	136.2	133.6	132.8	126.0
TextCaps (minival)	10B	140.3	145.3	145.4	145.4	144.2	141.0	125.8
	28B	150.9	149.0	150.2	145.5	144.0	142.1	126.2
TextVQA (val)	3B	57.6	58.7	59.3	59.6	59.4	58.0	51.1
	10B	63.4	64.1	63.9	63.2	61.6	58.1	48.3
	28B	64.5	64.7	65.3	64.8	63.3	59.3	49.9
VATEX (minival)	3B	84.4	87.2	89.8	90.7	90.2	90.2	86.3
	10B	91.4	93.2	93.4	93.7	90.4	89.9	84.5
	3B	80.9	81.5	82.1	82.7	82.4	81.9	79.6
	10B	83.8	84.1	84.3	83.7	83.1	82.0	79.4
	28B	83.8	84.1	84.1	83.8	82.8	82.0	79.7
	3B	72.5	74.2	74.8	76.4	76.6	76.7	74.0
VizWizVQA (val)	10B	76.1	77.1	77.8	78.0	77.3	77.2	73.3
	28B	76.3	77.6	78.2	78.8	77.8	76.7	72.5
WidgetCap (minival)	3B	137.0	141.9	141.8	142.3	141.7	140.6	129.7
	10B	146.3	148.4	150.9	148.2	144.5	140.8	133.3
	28B	144.0	147.6	145.9	147.0	144.1	143.0	133.0
	3B	44.2	43.9	43.7	42.7	41.7	40.8	37.8
XM3600 (avg35)	10B	45.0	44.5	43.9	42.1	40.7	39.3	36.8
	28B	45.2	44.6	44.0	42.3	41.1	39.1	35.8
	3B	83.7	83.1	82.2	79.1	78.3	76.9	70.9
	10B	82.5	80.6	78.6	75.0	73.0	72.0	69.9
	28B	80.9	79.8	79.4	76.4	73.6	71.3	66.1
	3B	51.7	54.0	55.3	58.0	58.7	57.8	49.1
xGQA (avg7)	10B	58.5	60.5	61.4	61.3	61.8	60.2	38.0
	28B	58.8	59.2	60.8	62.3	61.9	61.7	49.4

	224px²		448px²
Task	PG1	PG2	PG1	PG2
AI2D	72.1	74.7 ( $+2.6$ )	73.3	76.0 ( $+2.7$ )
AOKVQA-DA (val)	61.1	64.2 ( $+3.1$ )	65.7	67.9 ( $+2.2$ )
AOKVQA-MC (val)	78.5	79.7 ( $+1.2$ )	80.3	82.5 ( $+2.2$ )
ActivityNet-CAP	34.6	34.2 ( $-0.4$ )	-	-0 ( $+0.0$ )
ActivityNet-QA	50.8	51.3 ( $+0.5$ )	-	-0 ( $+0.0$ )
COCO-35L (avg34)	113.7	113.9 ( $+0.2$ )	115.8	115.8 ( $+0.0$ )
COCO-35L (en)	139.2	138.4 ( $-0.8$ )	141.2	140.4 ( $-0.8$ )
COCOcap	141.9	141.3 ( $-0.6$ )	144.6	143.4 ( $-1.2$ )
ChartQA (aug)	74.2	74.4 ( $+0.2$ )	88.5	89.2 ( $+0.7$ )
ChartQA (human)	40.0	42.0 ( $+2.0$ )	54.2	54.0 ( $-0.2$ )
CountBenchQA	81.9	81.0 ( $-0.9$ )	83.1	82.0 ( $-1.1$ )
DocVQA (val)	37.8	39.9 ( $+2.1$ )	74.1	73.6 ( $-0.5$ )
GQA	65.6	66.2 ( $+0.6$ )	67.0	68.1 ( $+1.1$ )
InfoVQA (val)	25.5	25.2 ( $-0.3$ )	37.0	37.5 ( $+0.5$ )
MARVL (avg5)	80.6	83.5 ( $+2.9$ )	76.8	82.7 ( $+5.9$ )
MSRVTT-CAP	70.5	68.5 ( $-2.0$ )	-	-0 ( $+0.0$ )
MSRVTT-QA	50.1	50.5 ( $+0.4$ )	-	-0 ( $+0.0$ )
MSVD-QA	60.2	61.1 ( $+0.9$ )	-	-0 ( $+0.0$ )
NLVR2	90.0	91.4 ( $+1.4$ )	88.9	91.6 ( $+2.7$ )
NoCaps	121.7	123.1 ( $+1.4$ )	123.6	123.5 ( $-0.1$ )
OCR-VQA	72.3	73.4 ( $+1.1$ )	74.6	75.7 ( $+1.1$ )
OKVQA	63.5	64.2 ( $+0.7$ )	63.2	64.1 ( $+0.9$ )
RSVQA-hr (test)	92.6	92.7 ( $+0.1$ )	92.8	92.8 ( $+0.0$ )
RSVQA-hr (test2)	90.6	90.9 ( $+0.3$ )	90.5	90.7 ( $+0.2$ )
RSVQA-lr	92.6	93.0 ( $+0.4$ )	93.1	92.7 ( $-0.4$ )
RefCOCO (testA)	75.7	75.7 ( $+0.0$ )	77.9	78.6 ( $+0.7$ )
RefCOCO (testB)	70.7	71.0 ( $+0.3$ )	72.4	73.5 ( $+1.1$ )
RefCOCO (val)	73.4	73.4 ( $+0.0$ )	75.6	76.3 ( $+0.7$ )
RefCOCO+ (testA)	71.9	72.7 ( $+0.8$ )	74.2	76.1 ( $+1.9$ )
RefCOCO+ (testB)	64.5	64.2 ( $-0.3$ )	64.5	67.0 ( $+2.5$ )
RefCOCO+ (val)	68.3	68.6 ( $+0.3$ )	69.8	72.1 ( $+2.3$ )
RefCOCOg (test)	68.2	69.0 ( $+0.8$ )	71.0	72.7 ( $+1.7$ )
RefCOCOg (val)	67.7	68.3 ( $+0.6$ )	70.1	72.3 ( $+2.2$ )
ST-VQA (val)	61.6	61.9 ( $+0.3$ )	79.7	80.5 ( $+0.8$ )
SciCap	162.3	165.1 ( $+2.8$ )	181.5	183.3 ( $+1.8$ )
ScienceQA	95.4	96.1 ( $+0.7$ )	95.9	96.2 ( $+0.3$ )
Screen2Words	117.6	113.3 ( $-4.3$ )	119.6	114.0 ( $-5.6$ )
TallyQA (complex)	69.6	70.3 ( $+0.7$ )	72.3	73.6 ( $+1.3$ )
TallyQA (simple)	81.7	81.8 ( $+0.1$ )	84.9	85.3 ( $+0.4$ )
TextCaps	127.5	127.5 ( $+0.0$ )	153.9	152.1 ( $-1.8$ )
TextVQA (val)	59.0	59.6 ( $+0.6$ )	74.6	75.2 ( $+0.6$ )
VATEX	79.7	80.8 ( $+1.1$ )	-	-0 ( $+0.0$ )
VQAv2 (minival)	82.1	83.0 ( $+0.9$ )	84.6	84.8 ( $+0.2$ )
VizWizVQA (val)	73.7	76.4 ( $+2.7$ )	75.5	77.5 ( $+2.0$ )
WidgetCap	136.1	138.1 ( $+2.0$ )	148.4	151.4 ( $+3.0$ )
XM3600 (avg35)	41.9	42.8 ( $+0.9$ )	42.4	43.2 ( $+0.8$ )
XM3600 (en)	78.0	79.8 ( $+1.8$ )	80.0	80.3 ( $+0.3$ )
xGQA (avg7)	57.3	58.6 ( $+1.3$ )	57.9	60.4 ( $+2.5$ )

表15: 224px²および448px²解像度におけるPaliGemma 3BとPaliGemma 2 3Bの比較。PG1とPG2はそれぞれPaliGemma [9]とPaliGemma 2を指す。

Indication	Radiologist report	PaliGemma 2 3B 896px² prediction
INDICATION: Woman with cardiomyopathy and cdiff with acute desaturation and dyspnea // PE, pulmonary edema, vs aspiration PE, pulmonary edema, vs aspiration.	IMPRESSION: Enlargement of the cardiac silhouette with pulmonary edema. Bilateral pleural effusions, more prominent on the left.	FINDINGS: There is substantial enlargement of the cardiac silhouette with pulmonary edema. Retrocardiac opacification is consistent with volume loss in the left lower lobe and pleural effusion. In the appropriate clinical setting, superimposed pneumonia would have to be considered.

	May 31, 2015
	Securities in an unrealized loss position for less than twelve months		Securities in an unrealized loss position for more than twelve months		Total
In millions	Gross unrealized losses	Fair value	Gross unrealized losses	Fair Value	Gross unrealized losses	Fair Value
Type of issue:
General obligation municipal bonds	$(3.8)	$355.1	$(0.5)	$26.3	$(4.3)	$561.4
Revenue municipal bonds	$(3.2)	361.6	–	–	(3.2)	361.6
Total	$(7.0)	$896.7	$(0.5)	$26.3	$(7.5)	$923.0

PaliGemma 2: A Family of Versatile VLMs for Transfer