Instructions to use bigscience/bloom with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloom with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloom")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") model = AutoModelForCausalLM.from_pretrained("bigscience/bloom") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigscience/bloom with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloom" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloom
- SGLang
How to use bigscience/bloom with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloom" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloom", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloom with Docker Model Runner:
docker model run hf.co/bigscience/bloom
How large is Bloom exactly to load all the checkpoints into gpu ram?
How large is Bloom exactly to load all the checkpoints into gpu ram?
How large of gpu ram would be needed to load all the checkpoints and fine tune it?
How large is Bloom exactly to load all the checkpoints into gpu ram?
You need 352G of GPU ram to load the weights in bfloat16 in GPUs.
How large of gpu ram would be needed to load all the checkpoints and fine tune it?
You never need to load all the checkpoints at once ... if you want to finetune you have to take in account optimizer states. Luckily you can try using DeepSpeed zero offload, it essentially moves the memory footprint to other spaces (either CPU RAM or Disk). @stas has written a great documentation about how to use it in transformers https://huggingface.co/docs/transformers/main_classes/deepspeed
so what is the least amount of A100 80gb gpus I need if I use deepspeed zero offload?
so what is the least amount of A100 80gb gpus I need if I use deepspeed zero offload?
The very minimum is probably going to be 1 A100. It's going to be very slow, but it's going to run. Offloading just means that it's going to use the CPU memory / disk space as additional memory so that you're not going to go out of memory.