Red Hat AI Inference Server Overview and Installation Guide (Linux)
“Anyone who’s actually tried to stand up an LLM server knows this: tokens are cheap, but infrastructure is not.” That’s exactly the gap Red Hat AI Inference Server is trying to fill. It gives you enterprise-grade operations, while still feeling, from a developer’s point of view, like a neatly packaged “vLLM-in-a-box.”
1. What is Red Hat AI Inference Server?
Red Hat AI Inference Server is an enterprise inference server designed to serve a wide range of LLM and generative models quickly and cost‑effectively across hybrid cloud environments.
It uses vLLM as its core engine and bundles it with LLM Compressor, a curated and validated model repository, and broad AI accelerator support into a single product.
- Provides a consistent inference environment across the hybrid cloud (on‑premises, public cloud, and edge).
- Acts as a unified inference layer supporting various accelerators (NVIDIA, AMD, Intel, IBM, cloud provider GPUs, and more).
- Exposes OpenAI‑compatible APIs to drastically reduce the integration effort for existing applications.
2. Key Features
2.1 High‑performance inference powered by vLLM
- vLLM core engine: Targets several‑times higher token throughput than traditional serving approaches using techniques like PagedAttention and continuous batching.
- Multi‑GPU and large context support: Handles large models and long contexts efficiently using tensor and pipeline parallelism.
- Efficient memory management: Optimizes KV cache handling to reduce GPU memory usage and increase throughput.
2.2 Enterprise‑grade packaging
- Hardened vLLM distribution: Red Hat ships tested and validated vLLM images as a supported package.
- Validated model repository: Provides optimized and verified models under the Red Hat AI organization on Hugging Face so you can serve them out of the box.
- Integrated LLM Compressor: Supports quantization and compression workflows to shrink model size while preserving or even improving accuracy.
2.3 Flexible deployment and APIs
- “Deploy anywhere”: Runs on RHEL / RHEL AI / OpenShift, and within policy limits can also be deployed on other Linux and Kubernetes platforms.
- OpenAI‑compatible HTTP API: Designed so that most existing client libraries just work, minimizing application changes.
- Enterprise operations: Built to integrate with Red Hat’s ecosystem for monitoring, logging, security, and upgrades.
3. Inference Server vs vLLM: What’s the difference?
The easiest way to think about it is to compare the relationship between the engine (vLLM) and the car (Inference Server). vLLM is the open‑source inference engine; Red Hat AI Inference Server is the commercial, operations‑ready product that ships with that engine under the hood.
3.1 Conceptual differences
| Item | Red Hat AI Inference Server | vLLM (open source) |
|---|---|---|
| Nature | Commercial enterprise product | Open‑source inference engine |
| Core engine | Bundles vLLM internally | vLLM itself |
| Purpose | Unified inference platform for hybrid cloud | High‑performance LLM serving engine |
| Support | Red Hat support, security, hardening | Community/self‑support |
3.2 Functional and operational differences
| Aspect | Red Hat AI Inference Server | vLLM (standalone) |
|---|---|---|
| Installation | Red Hat‑provided container images and docs, optimized for RHEL/OpenShift | Flexible: PyPI, Docker, source builds, etc. |
| Model repository | Curated catalog of Red Hat‑validated, optimized models | You pick and validate models yourself from Hugging Face, etc. |
| Optimization tools | Integrated with LLM Compressor, turnkey quantization/compression workflows | You must integrate separate quantization tools manually |
| Supported platforms | RHEL AI, OpenShift; other Linux in a 3rd‑party support scope | Runs almost anywhere, but operations are entirely on you |
| Security & updates | Red Hat advisories, patches, lifecycle management | You track releases and manage updates on your own |
In short, once you move beyond a “quick PoC with plain vLLM” and want a standard AI inference platform for your organization, Red Hat AI Inference Server starts to make a lot more sense.
4. Installing on Linux: Step‑by‑step
The following walkthrough shows how to spin up the Red Hat AI Inference Server container on a single Linux server with an NVIDIA GPU. It’s written with RHEL 9 in mind, but if your container runtime and GPU stack are correctly configured, similar steps will work on other compatible distributions (keeping in mind they may not be officially supported).
4.1 Prerequisites
- OS and privileges
- RHEL 9.x family (or a compatible Linux), with a user that has sudo privileges.
- Red Hat account and subscription
- A Red Hat account with access to
registry.redhat.ioand an active Inference Server subscription.
- A Red Hat account with access to
- GPU and drivers
- A data center‑class NVIDIA GPU (for example, A100 or L40S) with compatible drivers installed.
- Verify that the system recognizes the GPU using
nvidia-smi.
- Container runtime
- Podman (recommended by Red Hat) or Docker installed.
- Hugging Face token
- A Hugging Face account and access token if you plan to pull private models or integrate with the HF Hub.
4.2 Install Podman and the GPU stack (example)
# Install Podman (RHEL 9 example)
sudo dnf install -y podman
# After installing the NVIDIA driver, set up NVIDIA Container Toolkit
# (follow the official guide for your specific distribution and GPU)
nvidia-smi # Check that it prints normal GPU information
How you install and configure the NVIDIA Container Toolkit depends on your GPU and distribution, so be sure to follow the official NVIDIA documentation for your environment.
4.3 Log in to the Red Hat registry
# Log in to the Red Hat registry
podman login registry.redhat.io
# Use your Red Hat Customer Portal credentials for username/password
You must have the correct subscription in place to pull the Inference Server images.
4.4 Pull the Inference Server container image
Images are split by version and accelerator type. For example, an NVIDIA CUDA‑based vLLM image looks like this:
# Example: vLLM + CUDA + RHEL 9 Inference Server image (adjust version as needed)
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5
For other hardware (IBM Spyre, AMD ROCm, etc.), check the official Getting Started documentation for the exact image name corresponding to your accelerator.
4.5 Prepare SELinux and volumes
If SELinux is enabled, you need to allow device access and volume mounts accordingly.
# Example: create local directories for models and logs
sudo mkdir -p /opt/rhaiis/models
sudo mkdir -p /opt/rhaiis/logs
sudo chown -R $USER:$USER /opt/rhaiis
# Adjust SELinux context or policies as required by your environment.
# For quick testing, you can disable labels for the container:
# Use --security-opt=label=disable when running the container.
4.6 Set environment variables (models, tokens, etc.)
Set environment variables so the server can pull models from the HF Hub or use local paths.
export HUGGING_FACE_HUB_TOKEN="hf_xxx" # If needed
export RHAIIS_MODEL_ID="granite-3.3-8b-instruct" # Example model ID
export RHAIIS_PORT=8000
The model ID can be one of the validated models provided by Red Hat AI or any compatible model name from Hugging Face.
4.7 Run the Inference Server container (NVIDIA GPU example)
Here’s an example of running the vLLM‑based Inference Server on a single server with NVIDIA GPUs:
podman run --rm -d \
--name rhaiis-vllm \
--gpus all \
-p ${RHAIIS_PORT}:8000 \
-e HF_TOKEN="${HUGGING_FACE_HUB_TOKEN}" \
-e MODEL_ID="${RHAIIS_MODEL_ID}" \
-v /opt/rhaiis/models:/models \
-v /opt/rhaiis/logs:/logs \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5
- The
MODEL_IDdetermines which model is loaded by default; some images may expect slightly different configuration, so always check the sample commands in the official docs. - Depending on your hardware, you may need additional flags like
--deviceor--group-add(for example,--device=/dev/vfiofor IBM Spyre).
You can monitor model loading and server status via container logs:
podman logs -f rhaiis-vllm
5. Quick tests after installation
Inference Server typically exposes an OpenAI‑compatible HTTP API, which means you can test it easily with curl.
5.1 Health check
curl http://localhost:${RHAIIS_PORT}/health
- If you receive a 200 OK or a health status JSON, the server itself is up and running (the exact endpoint may vary by image/version, so double‑check the docs if needed).
5.2 Chat completion test (OpenAI‑style)
curl http://localhost:${RHAIIS_PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'"${RHAIIS_MODEL_ID}"'",
"messages": [
{"role": "user", "content": "Hi, which model is responding to me right now?"}
],
"max_tokens": 64,
"temperature": 0.2
}'
- If everything is working correctly, the JSON response will contain a
choices[0].message.contentfield with the model’s reply. - Because the API is OpenAI‑style, you can plug it into existing Python
openai/requestscode, JavaScriptfetch, and most other OpenAI‑compatible clients with minimal changes.
5.3 Simple performance sanity checks
- Send a short prompt multiple times in a row and roughly estimate tokens per second from the responses.
- Gradually increase concurrent requests and observe how latency and throughput change; this is where you’ll start to feel the impact of vLLM’s continuous batching.
6. Wrapping up
Once you’ve deployed an LLM at least once, you quickly realize there’s a big gap between “getting a model to respond” and “running it as a real service.” Red Hat AI Inference Server is best thought of as the enterprise‑grade operations layer that bridges that gap.
If you’ve already been running raw vLLM yourself, this is a good time to try the “packaged vLLM plus an operations‑ready platform” experience. Today we focused on a single‑server setup, but the natural next step is to take it onto OpenShift, add scaling and monitoring, and turn it into a full‑blown AI platform.