Content_IT

Red Hat AI Inference Server Overview and Installation Guide (Linux)

누한 2026. 2. 9. 18:16

Red Hat AI Inference Server Overview and Installation Guide (Linux)

“Anyone who’s actually tried to stand up an LLM server knows this: tokens are cheap, but infrastructure is not.” That’s exactly the gap Red Hat AI Inference Server is trying to fill. It gives you enterprise-grade operations, while still feeling, from a developer’s point of view, like a neatly packaged “vLLM-in-a-box.”

1. What is Red Hat AI Inference Server?

Red Hat AI Inference Server is an enterprise inference server designed to serve a wide range of LLM and generative models quickly and cost‑effectively across hybrid cloud environments.
It uses vLLM as its core engine and bundles it with LLM Compressor, a curated and validated model repository, and broad AI accelerator support into a single product.

Provides a consistent inference environment across the hybrid cloud (on‑premises, public cloud, and edge).
Acts as a unified inference layer supporting various accelerators (NVIDIA, AMD, Intel, IBM, cloud provider GPUs, and more).
Exposes OpenAI‑compatible APIs to drastically reduce the integration effort for existing applications.

2. Key Features

2.1 High‑performance inference powered by vLLM

vLLM core engine: Targets several‑times higher token throughput than traditional serving approaches using techniques like PagedAttention and continuous batching.
Multi‑GPU and large context support: Handles large models and long contexts efficiently using tensor and pipeline parallelism.
Efficient memory management: Optimizes KV cache handling to reduce GPU memory usage and increase throughput.

2.2 Enterprise‑grade packaging

Hardened vLLM distribution: Red Hat ships tested and validated vLLM images as a supported package.
Validated model repository: Provides optimized and verified models under the Red Hat AI organization on Hugging Face so you can serve them out of the box.
Integrated LLM Compressor: Supports quantization and compression workflows to shrink model size while preserving or even improving accuracy.

2.3 Flexible deployment and APIs

“Deploy anywhere”: Runs on RHEL / RHEL AI / OpenShift, and within policy limits can also be deployed on other Linux and Kubernetes platforms.
OpenAI‑compatible HTTP API: Designed so that most existing client libraries just work, minimizing application changes.
Enterprise operations: Built to integrate with Red Hat’s ecosystem for monitoring, logging, security, and upgrades.

3. Inference Server vs vLLM: What’s the difference?

The easiest way to think about it is to compare the relationship between the engine (vLLM) and the car (Inference Server). vLLM is the open‑source inference engine; Red Hat AI Inference Server is the commercial, operations‑ready product that ships with that engine under the hood.

3.1 Conceptual differences

Item	Red Hat AI Inference Server	vLLM (open source)
Nature	Commercial enterprise product	Open‑source inference engine
Core engine	Bundles vLLM internally	vLLM itself
Purpose	Unified inference platform for hybrid cloud	High‑performance LLM serving engine
Support	Red Hat support, security, hardening	Community/self‑support

3.2 Functional and operational differences

Aspect	Red Hat AI Inference Server	vLLM (standalone)
Installation	Red Hat‑provided container images and docs, optimized for RHEL/OpenShift	Flexible: PyPI, Docker, source builds, etc.
Model repository	Curated catalog of Red Hat‑validated, optimized models	You pick and validate models yourself from Hugging Face, etc.
Optimization tools	Integrated with LLM Compressor, turnkey quantization/compression workflows	You must integrate separate quantization tools manually
Supported platforms	RHEL AI, OpenShift; other Linux in a 3rd‑party support scope	Runs almost anywhere, but operations are entirely on you
Security & updates	Red Hat advisories, patches, lifecycle management	You track releases and manage updates on your own

In short, once you move beyond a “quick PoC with plain vLLM” and want a standard AI inference platform for your organization, Red Hat AI Inference Server starts to make a lot more sense.

4. Installing on Linux: Step‑by‑step

The following walkthrough shows how to spin up the Red Hat AI Inference Server container on a single Linux server with an NVIDIA GPU. It’s written with RHEL 9 in mind, but if your container runtime and GPU stack are correctly configured, similar steps will work on other compatible distributions (keeping in mind they may not be officially supported).

4.1 Prerequisites

OS and privileges
- RHEL 9.x family (or a compatible Linux), with a user that has sudo privileges.
Red Hat account and subscription
- A Red Hat account with access to registry.redhat.io and an active Inference Server subscription.
GPU and drivers
- A data center‑class NVIDIA GPU (for example, A100 or L40S) with compatible drivers installed.
- Verify that the system recognizes the GPU using nvidia-smi.
Container runtime
- Podman (recommended by Red Hat) or Docker installed.
Hugging Face token
- A Hugging Face account and access token if you plan to pull private models or integrate with the HF Hub.

4.2 Install Podman and the GPU stack (example)

# Install Podman (RHEL 9 example)
sudo dnf install -y podman

# After installing the NVIDIA driver, set up NVIDIA Container Toolkit
# (follow the official guide for your specific distribution and GPU)
nvidia-smi    # Check that it prints normal GPU information

How you install and configure the NVIDIA Container Toolkit depends on your GPU and distribution, so be sure to follow the official NVIDIA documentation for your environment.

4.3 Log in to the Red Hat registry

# Log in to the Red Hat registry
podman login registry.redhat.io

# Use your Red Hat Customer Portal credentials for username/password

You must have the correct subscription in place to pull the Inference Server images.

4.4 Pull the Inference Server container image

Images are split by version and accelerator type. For example, an NVIDIA CUDA‑based vLLM image looks like this:

# Example: vLLM + CUDA + RHEL 9 Inference Server image (adjust version as needed)
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5

For other hardware (IBM Spyre, AMD ROCm, etc.), check the official Getting Started documentation for the exact image name corresponding to your accelerator.

4.5 Prepare SELinux and volumes

If SELinux is enabled, you need to allow device access and volume mounts accordingly.

# Example: create local directories for models and logs
sudo mkdir -p /opt/rhaiis/models
sudo mkdir -p /opt/rhaiis/logs
sudo chown -R $USER:$USER /opt/rhaiis

# Adjust SELinux context or policies as required by your environment.
# For quick testing, you can disable labels for the container:
# Use --security-opt=label=disable when running the container.

4.6 Set environment variables (models, tokens, etc.)

Set environment variables so the server can pull models from the HF Hub or use local paths.

export HUGGING_FACE_HUB_TOKEN="hf_xxx"               # If needed
export RHAIIS_MODEL_ID="granite-3.3-8b-instruct"     # Example model ID
export RHAIIS_PORT=8000

The model ID can be one of the validated models provided by Red Hat AI or any compatible model name from Hugging Face.

4.7 Run the Inference Server container (NVIDIA GPU example)

Here’s an example of running the vLLM‑based Inference Server on a single server with NVIDIA GPUs:

podman run --rm -d \
  --name rhaiis-vllm \
  --gpus all \
  -p ${RHAIIS_PORT}:8000 \
  -e HF_TOKEN="${HUGGING_FACE_HUB_TOKEN}" \
  -e MODEL_ID="${RHAIIS_MODEL_ID}" \
  -v /opt/rhaiis/models:/models \
  -v /opt/rhaiis/logs:/logs \
  --security-opt=label=disable \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5

The MODEL_ID determines which model is loaded by default; some images may expect slightly different configuration, so always check the sample commands in the official docs.
Depending on your hardware, you may need additional flags like --device or --group-add (for example, --device=/dev/vfio for IBM Spyre).

You can monitor model loading and server status via container logs:

podman logs -f rhaiis-vllm

5. Quick tests after installation

Inference Server typically exposes an OpenAI‑compatible HTTP API, which means you can test it easily with curl.

5.1 Health check

curl http://localhost:${RHAIIS_PORT}/health

If you receive a 200 OK or a health status JSON, the server itself is up and running (the exact endpoint may vary by image/version, so double‑check the docs if needed).

5.2 Chat completion test (OpenAI‑style)

curl http://localhost:${RHAIIS_PORT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"${RHAIIS_MODEL_ID}"'",
    "messages": [
      {"role": "user", "content": "Hi, which model is responding to me right now?"}
    ],
    "max_tokens": 64,
    "temperature": 0.2
  }'

If everything is working correctly, the JSON response will contain a choices[0].message.content field with the model’s reply.
Because the API is OpenAI‑style, you can plug it into existing Python openai/requests code, JavaScript fetch, and most other OpenAI‑compatible clients with minimal changes.

5.3 Simple performance sanity checks

Send a short prompt multiple times in a row and roughly estimate tokens per second from the responses.
Gradually increase concurrent requests and observe how latency and throughput change; this is where you’ll start to feel the impact of vLLM’s continuous batching.

6. Wrapping up

Once you’ve deployed an LLM at least once, you quickly realize there’s a big gap between “getting a model to respond” and “running it as a real service.” Red Hat AI Inference Server is best thought of as the enterprise‑grade operations layer that bridges that gap.

If you’ve already been running raw vLLM yourself, this is a good time to try the “packaged vLLM plus an operations‑ready platform” experience. Today we focused on a single‑server setup, but the natural next step is to take it onto OpenShift, add scaling and monitoring, and turn it into a full‑blown AI platform.

저작자표시 (새창열림)

'Content_IT' 카테고리의 다른 글

Migrating from Bonobo Git to GitLab (0)	2024.10.23
[Langchain/Chroma] RuntimeError: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0 (0)	2024.02.23
Free Python, Web, ChatGPT API Server (github integration) (0)	2024.01.24
[script][jquery] datepicker from~to (daterangepicker) (0)	2024.01.20
csv to json ( any convert ), csv convert, convert to csv, pdf split (0)	2024.01.20

현재글Red Hat AI Inference Server Overview and Installation Guide (Linux)

다양한 일상의 경험을 나누고 싶어요 개발자로의 경험..... 내가 가봤던 곳들.... 내가 알려주고 싶은것들....

Watson, ai agent, A2A, XRP, 리플, 비캐 전망, 가상화폐, IBM Connections, ibm watson, 가상화폐 전망, 보험비교, XRP 전망, 보험 상담, llm, 리플전망, 비트코인, ChatGPT, 리플 전망, 비트코인 전망, mcp,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

일상의 끄적거림...