This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

2
Deepspeed for Windows! +accelerate backends & other stable diffusion advanced optimizations
Post Body

UPDATED V1.1

🌟DEEPSPEED

📦dependencies

Install Microsoft Visual C Build Tools. You can download it from here: https://visualstudio.microsoft.com/visual-cpp-build-tools/

Install the NVIDIA CUDA Toolkit. You can download it from here: https://developer.nvidia.com/cuda-downloads

Install the NVIDIA cuDNN library. You can download it from here: https://developer.nvidia.com/rdp/cudnn-download

Install Git for Windows. You can download it from here: https://git-scm.com/download/win

🌟Building&Installing Deepspeed for windows

pip install --upgrade pip setuptools wheel

⚙Building Deepspeed

💭Open a **CMD** prompt as administrator

cd %PYTHONPATH%/lib/site-packages/

💭Here we're downloading deepspeed build files and placing them in the site-packages directory but under a different name than the actual deepspeed package

git clone https://github.com/microsoft/deepspeed.git "%PYTHONPATH%/lib/site-packages/deepspeedrepo"

cd deepspeedrepo

pip install -r requirements\requirements.txt -r requirements\requirements-1bit-mpi.txt -r requirements\requirements-autotuning.txt -r requirements\requirements-autotuning-ml.txt -r requirements\requirements-dev.txt -r requirements\requirements-inf.txt -r requirements\requirements-readthedocs.txt -r requirements\requirements-sd.txt

💭initializing git before the build helps avoid win error 5(permisions)

git init

💭This will create a deepspeed folder in the build\lib subdirectory of your deepspeedrep/working directory

set DS_BUILD_AIO=0&set DS_BUILD_SPARSE_ATTN=0&set DS_BUILD_TRANSFORMER_INFERENCE=0&python setup.py bdist_wheel

💭AIO(async) is only on linux, sparse attn requires AIO, transformer inference has compiling problems for me

💭If you get any errors about triton, Torch2 has it's own internal triton therefore you should never have it as a package

(as needed) pip uninstall triton & retry your build

⚙Migrating(installing) deepspeed Wheel(built file)

💭>When you use the "pip install" command to install a .whl file,

>the package will be installed to the default location for Python packages,

>which is typically the "site-packages" directory in your Python installation.

>On Windows, the default location for the "site-packages"

>directory is "%python_path%\Lib\site-packages", where

>"%python_path%" is the path to your Python installation.

>So, if you install a .whl file using pip on Windows, the

>package will be installed to this directory by default.

💭Lets install the wheel(.whl) for Deepspeed that has been created in .\dist\

cd dist

💭optionally, activate your Venv

💭Install the .whl file without specifying its name usinf the glob module in Python to get a list of all .whl files in a directory,

💭and then pass that list to pip for installation.

for %i in (*.whl) do pip install %i

⚙Accelerate configuration

💭This is how you activate deepspeed and optimize Pytorch2. Pressing [enter] is the same as [default]

Torch Dynamo optimizations:Yes

Use Deepspeed: yes

Configure torch compiler: Yes

Set precision:BF16(mixed) or even fp8!(i use this)

💭other compiler notes

The torch.compile function in the accelerate package allows you to optimize your PyTorch model by compiling it to a lower-level representation that can be executed more efficiently on your hardware. Here are some options you can customize to potentially improve the performance of your stable diffusion model:

input_shapes: This option allows you to specify the shapes of the input tensors that your model expects. By providing accurate shapes, you can help the compiler generate more efficient code. For example, if your model expects a certain input shape most of the time, you can specify that shape to help the compiler generate specialized code for that case.

operators: This option allows you to specify which PyTorch operators should be compiled. By default, the compiler will compile all operators that are used in your model, but you can exclude some of them if you know they are not performance-critical. On the other hand, you can also include additional operators that you know will benefit from compilation.

optimize_for: This option allows you to specify the target device architecture that you want to optimize for. By default, the compiler will try to generate code that is efficient on a wide range of devices, but you can specify a more specific target to get better performance. For example, if you know that your model will only run on a certain type of GPU, you can specify that GPU's architecture to get more efficient code.

precision: This option allows you to specify the precision of the computations in your model. By default, the compiler will use the same precision as your PyTorch model, but you can specify a lower precision (e.g., float16) to get faster computations with some loss of accuracy.

optimize_graph: This option allows you to optimize the computation graph of your model before compilation. By default, the compiler will optimize the graph itself, but you can perform additional optimizations beforehand to reduce the amount of work the compiler has to do.

training: This option allows you to specify whether your model will be used for training or inference. By default, the compiler will assume your model is used for inference, but you can specify that it's used for training to get more efficient code for that use case.

Keep in mind that the optimal settings for torch.compile will depend on your specific model and hardware, so it's recommended to experiment with different settings and measure the performance to find the best configuration for your use case.

💭Choose a (faster)backend too! other than inductor (the default) these require installation but they're worth it. NVFuser and fx2trt are specifically good for nvidia users

**nvfuser** - nvFuser with TorchScript

**aot_nvfuser** - nvFuser with AotAutograd

**aot_cudagraphs** - cudagraphs with AotAutograd

*Inference-only backends*:

**ofi** - Uses Torchscript optimize_for_inference

**fx2trt** - Uses Nvidia TensorRT for inference optimizations

**onnxrt** - Uses ONNXRT for inference on CPU/GPU

**ipex** - Uses IPEX for inference on CPU

🌟OPTIMIZAIONS

⚙CUDNN CHECK AND BENCHMARK MODE

💭open cmd or shell

python

>>> import torch

💭is my cuda working?

>>> torch.backends.cudnn.is_available()

💭how about cudnn?

>>> torch.backends.cudnn.version()

💭optimize it

>>> import torch.backends.cudnn as cudnn

>>> cudnn.benchmark = True

📦More packages

💭make numpy faster (CuPy)

https://docs.cupy.dev/en/stable/install.html

replace 11x with 12x for cuda 12

pip install cupy-cuda11x

python -m cupyx.tools.install_library --cuda 11.x --library cutensor

💭add pytorch organized high speed library

pip install pytorch-lightning

💭let's get the latest and clean out their old counterparts

pip uninstall tensorflow tensorboard

pip install tf-nightly tb-nightly

Author
Account Strength
80%
Account Age
3 years
Verified Email
Yes
Verified Flair
No
Total Karma
2,467
Link Karma
405
Comment Karma
1,867
Profile updated: 2 days ago
Posts updated: 1 month ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
1 year ago