LemonHX

LemonHX

CEO of Limit-LAB 喜欢鼓捣底层的代码,意图改变世界
twitter
tg_channel

HedgeHogLab Where to Go: HHLAB Technology and Source Code Analysis

68457373.png

It is well known that HHLAB is a groundbreaking online data processing and computation framework created by the party.#

Background of HHLAB's Birth#

We all know that installing environments, configuring environments, setting up servers, and driving drivers is a very painful task. Moreover, we indeed want a certain kind of SaaS to provide an out-of-the-box data processing experience, especially since the computing skills of these data processing personnel are not much better than those of ordinary people in most cases. Their main advantages lie in their understanding of data processing algorithms and mathematics.

Thus, batch after batch of data processing frameworks or programming languages have been developed.

Currently, the mainstream ones on the market include:

  • python + tensorflow/torch + pandas + numpy
  • julia
  • MATLAB/OCATAVE
  • Wolfram Mathematica
  • FORTRAN
  • R + everything

The original intention of HHLAB's birth was to build a data processing platform on the web, where we only need to open a webpage to perform full-link data processing (from development to result output). This idea is very good.

Software that can achieve HHLAB's functionality with certain configurations#

First, it must be acknowledged that a large part of HHLAB's competitors can achieve the current effects of HHLAB with a bit of configuration:

  • First, developers can rent a cloud server, including GPU.
  • Then set up a Jupyter Server on the cloud server.
  • Then install the software.
  • Then log in using a browser.

All of the above software can achieve this effect, but we at HHLAB do not need the cumbersome configuration process above; we just need to use it.

HHLAB's True Competitors#

Services that we can enjoy just by paying, or even without paying:

So the first is from the Python ecosystem, and the second is from the official MMA. I will compare these products with the functions that HHLAB promotes and implements to see what the differences are.

User-Friendliness#

Programming Languages#

Both Python and JS are simple and easy-to-learn languages, but Python's flexibility is stronger than that of JS. Python can perform DSL-level development through metaprogramming, decorators, and various object operations, and the user experience is comparable to that of native programming languages.

import numpy as np

A = np.array([1.,2.,3.,4.,5.])
B = np.ones(5)

print(A + B)
// import math.js
const A = [1,2,3,4,5]
const B = math.ones(1,5)

console.log(math.add(A,B))

From this example, it can be seen that JS's flexibility at the language level cannot compete with Python.

However, our HHLAB can achieve similar behavior to Python.

Teacher Lidang added operator overloading (which I prefer to call built-in operators) specifically for numerical operations by writing a Babel plugin.

// https://github.com/Hedgehog-Computing/hedgehog-lab/blob/dev/packages/hedgehog-core/src/transpiler/operator-overload.ts

import template from 'babel-template';
import * as types from '@babel/types';

function invokedTemplate(op: any) {
  return template(`
    (function (LEFT_ARG, RIGHT_ARG) { 
      if (LEFT_ARG !== null && LEFT_ARG !== undefined
        && LEFT_ARG[Symbol.for("${op}")])
        return LEFT_ARG[Symbol.for("${op}")](RIGHT_ARG);
      else if (RIGHT_ARG instanceof Sym)
        return (sym(LEFT_ARG)[Symbol.for("${op}")](RIGHT_ARG));
      else if (Array.isArray(LEFT_ARG) && (RIGHT_ARG instanceof Mat))
        return (mat(LEFT_ARG)[Symbol.for("${op}")](RIGHT_ARG));
      else if (Array.isArray(LEFT_ARG) && (Array.isArray(RIGHT_ARG)))
        return (mat(LEFT_ARG)[Symbol.for("${op}")](mat(RIGHT_ARG)));
      else if (  (!isNaN(LEFT_ARG)) && (RIGHT_ARG instanceof Mat))
        return (scalar(LEFT_ARG)[Symbol.for("${op}")](RIGHT_ARG));
      else if (  (!isNaN(LEFT_ARG)) && (Array.isArray(RIGHT_ARG)))
        return (scalar(LEFT_ARG)[Symbol.for("${op}")](mat(RIGHT_ARG)));
      else if (  Array.isArray(LEFT_ARG) && (!isNaN(RIGHT_ARG)) )
        return (mat(LEFT_ARG)[Symbol.for("${op}")]((RIGHT_ARG)));

      else
        return LEFT_ARG ${op} RIGHT_ARG;
    })
  `);
}

Python's implementation of operator overloading follows the OOP approach, so we won't elaborate further. Let's first discuss what problems this might bring:

  1. Special handling is needed when the variables on either side of the operator are null and undefined.
  2. When an operator encounters an error, the compilation error is not user-friendly.
  3. Teacher Lidang needs to manually enumerate all types on both sides because he does not leverage the dispatch mechanism.

Everyone makes mistakes, so the reliability of this patch for JS operators is still questionable.

Now let's talk about MMA. MMA has a built-in symbolic computation engine, so it can easily handle infix expressions and write syntactic sugar:

g /: f[g[x_]] := fg[x]

(*When you input*)

{f[g[2]], f[h[2]]}

(*You will get*)

{fg[2], f[h[2]]}

This is clearly much more sophisticated than both Python and JS. The symbolic computation engine can handle many non-numerical things, but HHLAB should not be targeting this user base.

Then we can mention the performance aspect of programming languages:

photo_2022-05-07_18-40-56.jpg

In this regard, Julia undoubtedly takes the crown due to its excellent JIT design and very useful CUDA interoperability.
However, Julia is not a competitor of HHLAB, so we will not discuss it.

We can see that even tools like Scipy, which have been optimized in Python for many years, still cannot compete with commercial MATLAB and MMA in terms of performance.
Therefore, I do not believe that HHLAB, which does not use CUDA for heterogeneous acceleration, has the necessity to be on the list.

Debugging#

Python combined with Jupyter can achieve line-level execution, and when encountering errors that cannot be resolved, breakpoints can be used to troubleshoot.

MMA provides several ways to debug, but since MMA is not a procedural language, direct comparison is not easy:

VocabularyMeaning
With[{x=value},expr]compute expr with x replaced by value
Echo[expr]display and return the value of expr
Monitor[expr,obj]continually display obj during a computation
Sow[expr]sow the value of expr for subsequent reaping
Reap[expr]collect values sowed while expr is being computed

So far, I have not seen HHLAB produce debugging tools, nor can it run line by line.

User Functionality#

CPU Vectorization#

Numpy uses a lot of SIMD code to ensure that user data can be fully vectorized on the CPU.
I won't go into MMA... First, it is closed-source, and second, it is a high-level language that compiles based on CPU features, also with a lot of vectorization.

Lidang's attitude is:

图片 - 4.png

GPU Computing#

HHLAB uses GPU as one of its primary development backends, which is also its proudest point.

First, the things that can be accelerated by GPU are very dense, involving a lot of numerical computations with few jumps, while CPU is good at logical operations.

This is done through GPU.js, and since this library has a lot of code, I won't list it all, but I can provide the description below:

https://github.com/gpujs/gpu.js/wiki/Quick-Concepts
1. transpiling javascript for use on the GPU
1. read javascript to a common format, in this case a mozilla abstract syntax tree
2. type inference from any value or derivation of any value from the parsed javascript
3. translate from the mozilla abstract syntax tree to a string value of a language understood by the GPU, generally GLSL (a C++ subset), but likely more to come
4. adding required utility functions and environment corrections to said translated string
5. compiling the entire translated string, now likely in a subset of C++
2. uploading values (arguments or constants) needed to calculate the result of a kernel
3. calculating said kernel output
4. downloading value from kernel output (this step can be skipped by using the kernel setting pipeline: true)
- this is generally regarded as the most time-consuming part of calculating values from a GPU
- If you find yourself here, please ask yourself:
1. "Are the values I need, really needed?"
2. "Can I offset the values I need, to the GPU, and or return less often the values I think I need from the GPU?

So GPU.js executes this process through WebGL's compute shader:

  • When WebGL2 is available, it is used.
  • When WebGL2 is not available, WebGL1 will be used.
  • When WebGL1 is not available, CPU will be used.

I have studied graphics APIs, and WebGL's compute shader is implemented based on OpenGLES1.0-3.0, so compared to mainstream OpenCL and CUDA, it has many functional deficiencies (after all, it is meant for rendering).
Moreover, the browser's encapsulation of low-level APIs and the transcoding of objects passed to JS lead to greater performance loss.

Here are some benchmark data that may help you understand WebGL's computing performance.
This set of benchmarks was conducted by the TVM team, and the data aligns with my experience.

opengl-benchmark.png

Thus, we see that WebGL's performance is far below that of OpenGL.

Currently, there is an experimental API called WebGPU, which is a lower-level interface that looks very much like Vulkan and can provide better multi-core performance. Below is the WebGPU benchmark:

1-q5xCQtlrqv7TjhBU7nOrxg.png

So we can see that WebGL is merely a step from zero to one.

Additionally, this technology stack has a serious drawback: when the JS code is too complex, it may not be translatable to GLSL, so complex kernel functions cannot be offloaded to the GPU for computation.
Moreover, writing GLSL code is much more painful than CUDA.
So I am curious why Lidang did not choose the TVM framework, which already supports WebGPU.

Other issues include that the tensor cores that Nvidia has invested so much effort into become completely useless here.

Screenshot-2022-05-07-at-19-10-35-Nvidias-TENSOR-CORES.png

Since Lidang's target audience may not care about precision, this would be an excellent direction to help optimize the framework.
However, using WebGL technology certainly cannot achieve improvements in this area.

Now let's discuss how the other two platforms handle these issues.

Python: Collective Wisdom, Multiple Backends#

Python can use Tensorflow and Torch for AI, and numpy for numerical computation.
However, numpy still runs on the CPU, so there is a new project called cupy, which supports GPU computation.

MMA: Official Support#

MMA uses a very high-level language that runs directly on CUDA or OpenCL, so... no need to elaborate.

TargetDevice->"GPU"

Alright, you are now using the graphics card.

But is the graphics card really as useful as you imagine?#

When high precision is required, the graphics card may even compute slower than your CPU.

Screenshot-2022-05-07-at-19-07-19-MatmulNumaccCUDA.pdf.png

Moreover, in the real world, many numerical computations either have low bandwidth requirements and are logically complex, such as various state machines,
or involve complex precision calculations, and may even encounter the issue of defining new numerical formats. Therefore, in such cases, the graphics card may not be useful and could complicate matters.

Matrix Operations#

Let's first take a look at the matrix algorithm code in HHLAB.

Here is the source code for the following#

Cross Product

function multiply(leftMat: Mat, rightMat: Mat): Mat {
  if (leftMat.cols !== rightMat.rows)
    throw new Error('Dimension does not match for operation:muitiply');

  if (leftMat.mode === 'gpu' || rightMat.mode === 'gpu') return multiply_gpu(leftMat, rightMat);

  const m = leftMat.rows,
    n = leftMat.cols,
    p = rightMat.cols;
  const returnMatrix = new Mat().zeros(m, p);
  for (let i = 0; i < m; i++) {
    for (let j = 0; j < p; j++) {
      let val = 0;
      for (let it = 0; it < n; it++) val += leftMat.val[i][it] * rightMat.val[it][j];
      returnMatrix.val[i][j] = val;
    }
  }
  return returnMatrix;
}

Dot Product

function dotMultiplyInPlace(leftMat: Mat, rightMat: Mat): Mat {
  if (leftMat.rows !== rightMat.rows || leftMat.cols !== rightMat.cols)
    throw new Error('Dimension does not match for operation:dot muitiply');
  for (let i = 0; i < leftMat.rows; i++) {
    for (let j = 0; j < leftMat.cols; j++) {
      leftMat.val[i][j] *= rightMat.val[i][j];
    }
  }
  return leftMat;
}

Inverse Matrix

function (rightOperand: number): Mat {
    if (this.rows !== this.cols) throw new Error('This matrix does not support ^ operator');
    //if right operand is -1, return the inverse matrix
    if (rightOperand === -1) {
      // matrix inverse with mathjs
      return new Mat(mathjs.inv(this.val));
    }

    if (!Number.isInteger(rightOperand) || rightOperand < 1)
      throw new Error('This right operand does not support ^ operator');

    const returnMatrix = this.clone();
    for (let i = 2; i <= rightOperand; i++) {
      multiplyInPlace(returnMatrix, this);
    }

    return returnMatrix;
  }

Emmm, there isn't even an eigenvalue... never mind, it is what it is; this code has a very leetcode flavor.

At least you can do something like this /2/4, oh sorry, JS doesn't have multi-core, that's my problem.

We know that there are two different situations in actual matrix operations.
The acceleration of dense operations means that the more cores and higher bandwidth, the faster the computation, it's just brute force.
However, the problem arises with sparse matrices (which Lidang did not anticipate, and is also the most common scenario in numerical computation when performing regularization and filtering noise).
There can be several forms of layout:

  • COO: Coordinate (this records positions, but is not commonly used anymore, it's too old)
  • CSR: Compressed Sparse Row (this compresses rows when they are relatively dispersed)
  • CSC: Compressed Sparse Column (the same, but for columns)
  • BCSR: Blocked Compressed Sparse Row (compressing blocks, anyone who has studied linear algebra should immediately recognize this algorithm)

Now, mainstream scientific computing libraries provide an auto-tuning mechanism to profile before performing computations.
So this might be a knowledge blind spot for Lidang.

Ease of Use for Library Developers#

If a tool is advanced but cannot be explained to developers, developers will not be able to produce and construct an ecosystem.

Package Management Mechanism#

What problems does Lidang's idea of completing package management through copying links and developers creating branches have?

Suppose we have a package A, a package B, and a package C.
B depends on A, and C depends on B.
B finds that A has broken, so B needs to fork a copy of A, write a patch, and then add it to all places where B uses A.
C finds that B has broken A, but C does not know that B has fixed it, so C forks a copy of B and also forks a copy of A...

So now we understand the importance of package management, and I believe Lidang can recognize this point.

Python has two package management systems:

  • pip
  • anaconda

Both are very useful; pip has more packages, while conda handles scientific computing dependencies better and can also configure environment isolation. I won't elaborate further; everyone has a general idea.

Platform Services#

Screenshot-2022-05-07-at-19-48-19-Wolfram-Data-Drop-Universal-Data-Accumulator.png

I wonder how Lidang plans to compete here...

Where Should HHLAB Go from Here?#

 GitHub

bboczeng/why-you-do-not-need-hedgehog-lab

HHLAB may perfectly solve needs such as decorating homepages and assigning homework to college students.

Thus, HHLAB is a very excellent framework, and we look forward to it changing our scientific computing methods in the future, allowing us to run scientific computing code for free anytime and anywhere.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.