How to accelerate a Python function with PYNQ

March 21, 2018, 7:16 pm

≫ Next: PYNQ Computer Vision demo: 2D filter and dilate

≪ Previous: Avnet Silica’s industrial networking demo features Ethernet FMC

This video demonstrates how you would typically go about accelerating a Python function or algorithm on the Zynq-7000 with PYNQ. The function I chose to base this video on is the Finite Impulse Response (FIR) filter because the SciPy package contains the lfilter function which can be used for this purpose, and because the Xilinx IP catalog has a free FIR filter IP core. If you instead wanted to implement the accelerator in HLS, the process would be very similar, you would just have to design your accelerator with AXI-Streaming interfaces and ensure that the TLAST signals were properly managed.

You can follow through the tutorial by copying and pasting the code from the Jupyter notebook below, or you can download the notebook and copy it to your PYNQ-Z1 board. I suggest that you generate the PYNQ overlay yourself by following the steps in the video, but I have also left a link to the files here.

The filter coefficients were generated using this free online FIR filter design calculator. I used a passband of 0-5MHz and a stopband of 10MHz to 50MHz. When you enter the coefficients into Vivado, the easiest way is to just copy them out of the Jupyter notebook below at the line “coeffs = [-255,-260,-312,…]” (only copy the numbers!).

↧

PYNQ Computer Vision demo: 2D filter and dilate

March 28, 2018, 6:53 am

≫ Next: List of PYNQ projects and ports

≪ Previous: How to accelerate a Python function with PYNQ

See what the PYNQ-Z1 and the PYNQ Computer Vision overlay are capable of doing with a 720p standard HD video stream. In the video we run a 2D filter and dilate function on the incoming video, first using the Python OpenCV functions (ie. without hardware acceleration), then we test it again with the accelerator IPs running on the FPGA. Without acceleration, we get a frame rate of 5 frames per second. At that frame rate the flicker is obvious. When we switch to the hardware accelerated functions, we get 60 frames per second and we see a very smooth video output.

Below is the Jupyter notebook that I used for the demo:

If you’re wondering, yes it’s a video of Canadian landmarks (my adopted country), so I’m sure you’ll enjoy the video even if you have no interest in computer vision!

In the next video I’ll see what frame rates we can get while running at 1080p (full HD) resolution. According to the PYNQ docs, the PYNQ-Z1 cannot meet the official HDMI spec for 1080p, but we’ll try it anyway and see how it goes.

↧

List of PYNQ projects and ports

March 29, 2018, 8:06 am

≫ Next: Setting up the PYNQ-Z1 for the Intel Movidius Neural Compute Stick

≪ Previous: PYNQ Computer Vision demo: 2D filter and dilate

PYNQ enables huge productivity gains by making it possible to program the Zynq-7000 SoC with a high-level programming language (Python) and leverage the power of FPGA hardware acceleration with ease. Xilinx first designed PYNQ to target the PYNQ-Z1 board but it wasn’t long before others saw the potential of running PYNQ on other platforms. This post is a list of open-sourced PYNQ projects and ports that run on other platforms. I’ll keep the list up-to-date but if you know of an open-sourced PYNQ project or port that I haven’t found yet, please let me know and I’ll add it to the list.

PYNQ-Z1

MicroZed

MicroZed 7010 and 7020

Zybo

Milan Release of Main PYNQ Repo Apparently the first release of PYNQ targetted the Zybo

Related guides

↧

Setting up the PYNQ-Z1 for the Intel Movidius Neural Compute Stick

April 20, 2018, 11:34 am

≫ Next: Board bring-up: MYIR MYD-Y7Z010 Dev board

≪ Previous: List of PYNQ projects and ports

The Intel Movidius Neural Compute Stick (NCS) is a neural network computation engine in a USB stick form factor. It’s based on the Myriad-2 chip, referred to by Movidius as a VPU or Visual Processing Unit, basically a processor that was specifically designed to accelerate neural network computations, and with relatively low power requirements. The NCS is a great match for single board computers like the Raspberry Pi, the Beagle Bone and especially the PYNQ-Z1. Now these boards can all run software based neural networks, but not very quickly, so their potential in fast moving applications is limited. When paired with a Movidius NCS, they can achieve tremendously better inference times by offloading all of the heavy neural network computation to the Myriad chip.

So why do I think that the NCS is an especially good fit for the PYNQ-Z1? These days AI and neural networks are finding new uses in many applications but one of the biggest and fastest growing ones is computer vision. The PYNQ-Z1 is one of the best platforms available today for developing embedded vision applications and here’s why: it’s got both a HDMI input and a HDMI output, and it’s got FPGA fabric that you can use to hardware accelerate image processing algorithms (among other things). So I figured that it would be a good idea to try matching these up and seeing what kind of interesting things I could demo.

One other thing before we get started… you might ask yourself: why wouldn’t you accelerate the neural network on the FPGA fabric of the Zynq? Well that would be the ideal way to do it on the PYNQ-Z1 board, and by the way Xilinx has already done it (see the QNN and BNN projects). Unfortunately, neural networks are very resource heavy and the PYNQ-Z1 has one of the lower-cost Zynq devices on it, with FPGA resources that are probably a bit limited for neural networks (that’s a very generalized statement but it obviously depends on what you want the network to do). In my opinion, a better use of the FPGA resources of the PYNQ-Z1 would be for image processing to support an externally implemented neural network, such as the NCS.

Setup the hardware

Firstly, you want to connect the Movidius NCS to the PYNQ-Z1 board, but to do this, you’re going to need a powered USB hub. The key word is “powered”, because you’ll find that the PYNQ-Z1 alone can’t quite supply enough current to the NCS. I tried it first without a USB hub, and I found that the PYNQ-Z1 wouldn’t even boot up. Yes, the image at the top of this post is misleading but it’s simpler and it got you interested, didn’t it?

Setup the SD card of the PYNQ-Z1

You’ll need to install a lot of Linux and Python packages onto your PYNQ-Z1, so I suggest that you use a separate SD card for your PYNQ-NCS projects. So with a brand new SD card, follow my previous video tutorial to write the precompiled PYNQ image to it using Win32DiskImager.

Install the dependencies

Power up the PYNQ-Z1, then when the LEDs flash, open up a web browser to access Jupyter (http://pynq:9090). Login to Jupyter using the password “xilinx” and then select New->Terminal from the drop down menu on the right hand side of the screen. From this Linux terminal, you will then install the dependencies using the following commands. Note that you should already be logged in as the root, so you shouldn’t need to use “sudo” with these commands:

apt-get install -y libprotobuf-dev libleveldb-dev libsnappy-dev
apt-get install -y  libopencv-dev libhdf5-serial-dev 
apt-get install -y protobuf-compiler byacc libgflags-dev 
apt-get install -y libgoogle-glog-dev liblmdb-dev libxslt-dev

To save you time, these commands only install the packages that are not already built into the precompiled PYNQ-Z1 image. If you’re not starting from the standard PYNQ-Z1 image, then you might be better off installing all of the dependencies as shown in the Movidius guide for the Raspberry Pi (they are the same for the PYNQ-Z1).

Install the NC SDK (in API-only mode)

The NC SDK contains a Toolkit and API. The Toolkit is used for profiling, compiling and tuning neural networks, while the API is used to connect applications with the NCS. To keep the installation light on the PYNQ-Z1, we only want to install the API. We’d normally install the Toolkit on a development PC, although I wont go into those details in this post. The terminal in Jupyter normally leaves you in the /home/xilinx directory. You can run the following commands from that directory to download the NC SDK and NC App Zoo.

mkdir workspace
cd workspace
git clone https://github.com/movidius/ncsdk
git clone https://github.com/movidius/ncappzoo

Now we move into the API source directory.

cd ncsdk/api/src

In this directory there is a Makefile for compiling and installing the API. We will need to make a small modification to it so that it installs the Python libraries to Python 3.6 and not Python 3.5. Open the Makefile for editing using the vi editor.

vi Makefile

Once in the vi editor, press “i” to start inserting text. Use the arrows to navigate down to the reference to “python3”, position the cursor to the end of it and add “.6” to the end of that reference (it should read “python3.6”). Then press ESC to get out of insert mode and type “:x” (colon then x) and press ENTER to save the file.

Now we can compile and install the API.

make
make install

Install the NC App Zoo

The NC App Zoo contains lots of example applications that you can learn from. We’re going to use one of them to do a simple test with our NCS.

cd ../../../ncappzoo/apps/hello_ncs_py

Again here we are going to have to modify the Makefile in this directory and replace the reference to “python3” with “python3.6”. Open the Makefile in the vi editor and make the change.

vi Makefile

Now we can run the example.

make run

You should see the following output:

making run
python3.6 hello_ncs.py;
Hello NCS! Device opened normally.
Goodbye NCS! Device closed normally.
NCS device working.

Test the YOLO project for PYNQ-Z1

Now we can try out the YOLO project for PYNQ-Z1 and use a pre-built graph file (normally we would have to compile the graph file on our development PC with the NC Toolkit). Let’s start by cloning the YOLO for PYNQ-Z1 project.

cd /home/xilinx/jupyter_notebooks
git clone https://github.com/fpgadeveloper/pynq-ncs-yolo.git

Then we download the prebuilt graph file.

cd pynq-ncs-yolo
wget "http://fpgadeveloper.com/downloads/2018_04_19/graph"

Now we can run the YOLO single image example.

cd py_examples
python3.6 yolo_example.py ../graph ../images/dog.jpg ../images/dog_output.jpg

You should get this output:

Device 0 Address: 1.4 - VID/PID 03e7:2150
Starting wait for connect with 2000ms timeout
Found Address: 1.4 - VID/PID 03e7:2150
Found EP 0x81 : max packet size is 512 bytes
Found EP 0x01 : max packet size is 512 bytes
Found and opened device
Performing bulk write of 865724 bytes...
Successfully sent 865724 bytes of data in 211.110297 ms (3.910841 MB/s)
Boot successful, device address 1.4
Found Address: 1.4 - VID/PID 03e7:f63b
done
Booted 1.4 -> VSC
total time is " milliseconds 285.022
(768, 576)
    class : car , [x,y,w,h]=[566,131,276,128], Confidence = 0.29101133346557617
    class : bicycle , [x,y,w,h]=[384,290,455,340], Confidence = 0.24596166610717773
root@pynq:/home/xilinx/jupyter_notebooks/pynq-ncs-yolo/py_examples#

In Jupyter, you will be able to browse through to the output image and view it (/pynq-ncs/yolo/images/dog_output.jpg).

So now you should be setup and able to run the Jupyter notebooks in the YOLO for PYNQ-Z1 project. Checkout a demo of one of the notebooks in this video.

In the video we get about 3fps when we don’t do any resize operation in software. When we do resizing in software the frame rate drops to about 1.5fps. If we offloaded the resize operation to the FPGA we’d get that frame rate back up to 3fps and we should be able to boost that to 6fps if we use threading and the second processor in the Zynq-7000 SoC, but I’ll leave that for another video.

↧

Board bring-up: MYIR MYD-Y7Z010 Dev board

May 3, 2018, 5:54 pm

≫ Next: Introducing 96B Quad Ethernet Mezzanine

≪ Previous: Setting up the PYNQ-Z1 for the Intel Movidius Neural Compute Stick

In this tutorial video, I bring-up the 3x Gigabit Ethernet ports on the MYD-Y7Z010 Development board from MYIR. Firstly, I create a Vivado design for this board, then I export it into the SDK and generate the echo server application for each of the 3 ports (note that the echo server application only supports one port at a time). At the end of the video, I test each of these designs on hardware and ensure that the ports are given an IP address via DHCP and that I can ping the port. I did this on the MYIR dev board but I hope that the tutorial can be of help to people bringing up Ethernet ports on other platforms or their own custom boards.

Requirements

To go through this tutorial yourself, you’ll need:

the board files that I’ve placed in this Github repository: Projects for the MYD-Y7Z010 Development board
Vivado 2018.1 – I’ve done the tutorial with Vivado 2018.1, but you should be able to do it in future versions without too much trouble
an RS232 to USB converter
a network running DHCP
a CAT-5e/6 Ethernet cable

Description

The MYIR board is based on the Zynq 7010 device, so we’ll make use of the two build-in GEMs of the Zynq PS and we’ll use AXI Ethernet Subsystem IP for the third port. The image below shows how the ports are connected through the Ethernet PHYs to the RJ45 connectors. All of the PHYs have an RGMII interface to the MACs.

As you can see in the block diagram, one of the PHYs is on the module (SoM) and this PHY is directly connected (via MIO pins) to GEM0 of the Zynq PS. The other two PHYs are on the carrier board and they connect to the FPGA (PL) of the Zynq. For one of these PHYs, we’ll route GEM1 to the PL via EMIO, and we’ll use a GMII-to-RGMII IP to convert the GMII interface to the RGMII for the PHY connection. For the last PHY we will use the AXI Ethernet Subsystem IP.

UART for debug

When working with any board, a UART for debug is handy. This MYIR board has a few UART options but unfortunately none of them is a USB-UART:

3.3V TTL UART to a pin header (connects to UART1 of the Zynq PS)
RS232 UART to the DB9 connector (connects to PL)
RS485 UART to the DB9 connector (connects to PL)

If you’re going through this yourself, I suggest you use the UART option that is most convenient for you. In my case, I’ve got a RS232 to USB converter handy, so I’m using the RS232 option. The MYIR board comes with a rainbow ribbon cable to breakout the DB9 connector, so I’ve just wired up the RS232 signals to the DB9 connector of the converter:

White wire (RS232 GND) to pin 5 (GND) of the DB9 of the converter
Grey wire (RS232 TX) to pin 2 (RX) of the DB9 of the converter
Purple wire (RS232 RX) to pin 5 (TX) of the DB9 of the converter

Board files

Before creating the Vivado project for this board, you should copy the board files into your Vivado installation. Find the board files in the Github repo here:

https://github.com/fpgadeveloper/myd-y7z010-projects/tree/master/Vivado/boards/board_files/MYD-Y7Z010/1.1

Respecting that directory structure, copy the board files into your Vivado installation here:

\Xilinx\Vivado\VERSION\data\boards\board_files

The next time you run Vivado, the MYIR board will be on the list of boards when creating a new project.

lwIP Library modifications

When we get to the SDK, we only have to generate the echo server application/template for each of the 3 ports. However, as is typical when working with the echo server application, we have to modify some of the code that deals with configuration of the PHY. The code already handles some Marvell and TI PHYs, but not the Microchip PHY that this MYIR board uses (KSZ9031). I’ve detailed the code modifications below, but I’ve also included the modified code in the Github repo.

The best way to deal with library modifications is to create a local copy of the original library and bump up the version number. You modify this local copy and then add it’s location as a repository to your SDK workspace. Then when generating the echo server application, the SDK will use your modified library instead of the original library. In the Github repo, I’ve already created this local copy of the library, but only containing the modified sources. To move the rest of the sources into this local copy, I’ve written a Tcl script that you can run. It’s a very simple script and in fact, if you prefer to just copy the files over manually, you can – just make sure not to overwrite the files that are already in the repo, as they contain the required modifications.

For those who prefer to make the modifications themselves, or who want to better understand the modifications, I’ve described them below.

File to modify

Filename: xemacpsif_physpeed.c
Location: \EmbeddedSw\ThirdParty\sw_services\lwip202_v1_09\src\contrib\ports\xilinx\netif

This file contains the code for configuration of the PHY connected to the GEMs. There are two things we need to change in this file. Firstly, we need to add a function to configure the Microchip PHY; for the most part, our function is the same as the one for the Marvell PHY, except that the detection of the link speed is done through a different register. Secondly, we need to create a define for the PHY address of the GMII-to-RGMII converter, so that the code makes the necessary register changes to the core after link-up.

Add these defines for the Microchip PHY identifier and to specify the PHY address of the GMII-to-RGMII converter:


#define PHY_MICROCHIP_IDENTIFIER 0x0022
#define XPAR_GMII2RGMIICON_0N_ETH1_ADDR 8

Add this function for the configuration of the Microchip PHY (you can place it below the get_Marvell_phy_speed function):


static u32_t get_Microchip_phy_speed(XEmacPs *xemacpsp, u32_t phy_addr)
{
	u16_t temp;
	u16_t control;
	u16_t status;
	u16_t status_speed;
	u32_t timeout_counter = 0;

	xil_printf("Start PHY autonegotiation \r\n");

	XEmacPs_PhyWrite(xemacpsp,phy_addr, IEEE_PAGE_ADDRESS_REGISTER, 2);
	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_CONTROL_REG_MAC, &control);
	control |= IEEE_RGMII_TXRX_CLOCK_DELAYED_MASK;
	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_CONTROL_REG_MAC, control);

	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_PAGE_ADDRESS_REGISTER, 0);

	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_AUTONEGO_ADVERTISE_REG, &control);
	control |= IEEE_ASYMMETRIC_PAUSE_MASK;
	control |= IEEE_PAUSE_MASK;
	control |= ADVERTISE_100;
	control |= ADVERTISE_10;
	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_AUTONEGO_ADVERTISE_REG, control);

	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_1000_ADVERTISE_REG_OFFSET,
					&control);
	control |= ADVERTISE_1000;
	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_1000_ADVERTISE_REG_OFFSET,
					control);

	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_PAGE_ADDRESS_REGISTER, 0);
	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_COPPER_SPECIFIC_CONTROL_REG,&control);
	control |= (7 << 12);	/* max number of gigabit attempts */
	control |= (1 << 11);	/* enable downshift */
	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_COPPER_SPECIFIC_CONTROL_REG,control);
	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_CONTROL_REG_OFFSET, &control);
	control |= IEEE_CTRL_AUTONEGOTIATE_ENABLE;
	control |= IEEE_STAT_AUTONEGOTIATE_RESTART;
	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_CONTROL_REG_OFFSET, control);

	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_CONTROL_REG_OFFSET, &control);
	control |= IEEE_CTRL_RESET_MASK;
	XEmacPs_PhyWrite(xemacpsp, phy_addr, IEEE_CONTROL_REG_OFFSET, control);

	while (1) {
		XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_CONTROL_REG_OFFSET, &control);
		if (control & IEEE_CTRL_RESET_MASK)
			continue;
		else
			break;
	}

	XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_STATUS_REG_OFFSET, &status);

	xil_printf("Waiting for PHY to complete autonegotiation.\r\n");

	while ( !(status & IEEE_STAT_AUTONEGOTIATE_COMPLETE) ) {
		sleep(1);
		XEmacPs_PhyRead(xemacpsp, phy_addr,
						IEEE_COPPER_SPECIFIC_STATUS_REG_2,  &temp);
		timeout_counter++;

		if (timeout_counter == 30) {
			xil_printf("Auto negotiation error \r\n");
			return XST_FAILURE;
		}
		XEmacPs_PhyRead(xemacpsp, phy_addr, IEEE_STATUS_REG_OFFSET, &status);
	}
	xil_printf("autonegotiation complete \r\n");

  // Read from Microchip page 0, register 0x1F (PHY Control)
  // http://ww1.microchip.com/downloads/en/DeviceDoc/00002117F.pdf
	XEmacPs_PhyRead(xemacpsp, phy_addr,0x1F,&status_speed);
	if (status_speed & 0x040)
		return 1000;
	else if(status_speed & 0x020)
		return 100;
	else if(status_speed & 0x010)
		return 10;

	return XST_SUCCESS;
}

Modify the get_IEEE_phy_speed function as below so that it calls the get_Microchip_phy_speed function (above) to configure the Microchip PHY:


static u32_t get_IEEE_phy_speed(XEmacPs *xemacpsp, u32_t phy_addr)
{
	u16_t phy_identity;
	u32_t RetStatus;

	XEmacPs_PhyRead(xemacpsp, phy_addr, PHY_IDENTIFIER_1_REG,
					&phy_identity);
	if (phy_identity == PHY_TI_IDENTIFIER) {
		RetStatus = get_TI_phy_speed(xemacpsp, phy_addr);
	} else if (phy_identity == PHY_REALTEK_IDENTIFIER) {
		RetStatus = get_Realtek_phy_speed(xemacpsp, phy_addr);
	} else if (phy_identity == PHY_MICROCHIP_IDENTIFIER) {
		RetStatus = get_Microchip_phy_speed(xemacpsp, phy_addr);
	} else {
		RetStatus = get_Marvell_phy_speed(xemacpsp, phy_addr);
	}

	return RetStatus;
}

File to modify

Filename: xaxiemacif_physpeed.c
Location: \EmbeddedSw\ThirdParty\sw_services\lwip202_v1_09\src\contrib\ports\xilinx\netif

This file contains the code for configuration of the PHYs connected to AXI Ethernet Subsystem IP. There are two things we need to change in this file. Firstly, we need to add a function to configure the Microchip PHY; as described earlier, our function is mostly the same as the one for the Marvell PHY, except that the detection of the link speed is done through a different register. Secondly, because we are using AXI Ethernet Subsystem IP, we need to disable the RGMII TX clock delay that is internal to the PHY (for more information on this topic, read RGMII Timing Considerations).

Add these defines to specify the Microchip identifier and the register masks for TX and RX clock delay settings:

#define PHY_MICROCHIP_IDENTIFIER 0x0022
#define IEEE_RGMII_TX_CLOCK_DELAYED_MASK 0x0010
#define IEEE_RGMII_RX_CLOCK_DELAYED_MASK 0x0020

Add this function to configure the Microchip PHY, you can add it below the get_phy_speed_88E1116R function:

unsigned int get_phy_speed_Microchip(XAxiEthernet *xaxiemacp, u32 phy_addr)
{
	u16 phy_val;
	u16 control;
	u16 status;
	u16 partner_capabilities;

	xil_printf("Start PHY autonegotiation \r\n");

  /* RGMII with only RX internal delay enabled */
	XAxiEthernet_PhyWrite(xaxiemacp,phy_addr, IEEE_PAGE_ADDRESS_REGISTER, 2);
	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_CONTROL_REG_MAC, &control);
  control &= ~IEEE_RGMII_TX_CLOCK_DELAYED_MASK;
  control |= IEEE_RGMII_RX_CLOCK_DELAYED_MASK;
	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_CONTROL_REG_MAC, control);

	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_PAGE_ADDRESS_REGISTER, 0);

	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_AUTONEGO_ADVERTISE_REG, &control);
	control |= IEEE_ASYMMETRIC_PAUSE_MASK;
	control |= IEEE_PAUSE_MASK;
	control |= ADVERTISE_100;
	control |= ADVERTISE_10;
	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_AUTONEGO_ADVERTISE_REG, control);

	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_1000_ADVERTISE_REG_OFFSET,
				&control);
	control |= ADVERTISE_1000;
	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_1000_ADVERTISE_REG_OFFSET,
				control);

	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_PAGE_ADDRESS_REGISTER, 0);
	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_COPPER_SPECIFIC_CONTROL_REG,
				&control);
	control |= (7 << 12);	/* max number of gigabit atphy_valts */
	control |= (1 << 11);	/* enable downshift */
	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_COPPER_SPECIFIC_CONTROL_REG,
				control);

	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_CONTROL_REG_OFFSET, &control);
	control |= IEEE_CTRL_AUTONEGOTIATE_ENABLE;
	control |= IEEE_STAT_AUTONEGOTIATE_RESTART;
	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_CONTROL_REG_OFFSET, control);

	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_CONTROL_REG_OFFSET, &control);
	control |= IEEE_CTRL_RESET_MASK;
	XAxiEthernet_PhyWrite(xaxiemacp, phy_addr, IEEE_CONTROL_REG_OFFSET, control);
	while (1) {
		XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_CONTROL_REG_OFFSET, &control);
		if (control & IEEE_CTRL_RESET_MASK)
			continue;
		else
			break;
	}

	xil_printf("Waiting for PHY to complete autonegotiation.\r\n");

	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_STATUS_REG_OFFSET, &status);
	while ( !(status & IEEE_STAT_AUTONEGOTIATE_COMPLETE) ) {
		AxiEthernetUtilPhyDelay(1);
		XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_COPPER_SPECIFIC_STATUS_REG_2,
							&phy_val);
		if (phy_val & IEEE_AUTONEG_ERROR_MASK) {
			xil_printf("Auto negotiation error \r\n");
		}
		XAxiEthernet_PhyRead(xaxiemacp, phy_addr, IEEE_STATUS_REG_OFFSET,
					&status);
	}

	xil_printf("autonegotiation complete \r\n");

  // Read from Microchip page 0, register 0xA (PHY Control)
  // http://ww1.microchip.com/downloads/en/DeviceDoc/00002117F.pdf
	XAxiEthernet_PhyRead(xaxiemacp, phy_addr,0x1F,&status);
	if (status & 0x040)
		return 1000;
	else if(status & 0x020)
		return 100;
	else
		return 10;
}

Modify the get_IEEE_phy_speed function as below so that it calls the get_phy_speed_Microchip function (above) to configure the Microchip PHY:

unsigned get_IEEE_phy_speed(XAxiEthernet *xaxiemacp)
{
	u16 phy_identifier;
	u16 phy_model;
	u8 phytype;

#ifdef XPAR_AXIETHERNET_0_BASEADDR
	u32 phy_addr = detect_phy(xaxiemacp);

	/* Get the PHY Identifier and Model number */
	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, PHY_IDENTIFIER_1_REG, &phy_identifier);
	XAxiEthernet_PhyRead(xaxiemacp, phy_addr, PHY_IDENTIFIER_2_REG, &phy_model);

/* Depending upon what manufacturer PHY is connected, a different mask is
 * needed to determine the specific model number of the PHY. */
	if (phy_identifier == MARVEL_PHY_IDENTIFIER) {
		phy_model = phy_model & MARVEL_PHY_MODEL_NUM_MASK;

		if (phy_model == MARVEL_PHY_88E1116R_MODEL) {
			return get_phy_speed_88E1116R(xaxiemacp, phy_addr);
		} else if (phy_model == MARVEL_PHY_88E1111_MODEL) {
			return get_phy_speed_88E1111(xaxiemacp, phy_addr);
		}
	} else if (phy_identifier == PHY_MICROCHIP_IDENTIFIER) {
		return get_phy_speed_Microchip(xaxiemacp, phy_addr);
	} else if (phy_identifier == TI_PHY_IDENTIFIER) {
		phy_model = phy_model & TI_PHY_DP83867_MODEL;
		phytype = XAxiEthernet_GetPhysicalInterface(xaxiemacp);

		if (phy_model == TI_PHY_DP83867_MODEL && phytype == XAE_PHY_TYPE_SGMII) {
			return get_phy_speed_TI_DP83867_SGMII(xaxiemacp, phy_addr);
		}

		if (phy_model == TI_PHY_DP83867_MODEL) {
			return get_phy_speed_TI_DP83867(xaxiemacp, phy_addr);
		}
	}
	else {
	    LWIP_DEBUGF(NETIF_DEBUG, ("XAxiEthernet get_IEEE_phy_speed: Detected PHY with unknown identifier/model.\r\n"));
	}
#endif
#ifdef PCM_PMA_CORE_PRESENT
	return get_phy_negotiated_speed(xaxiemacp, phy_addr);
#endif
}

Booting from SD card

When testing on hardware, unfortunately the MYIR board has the legacy 100mil pitch JTAG header, and I don't have an appropriate adapter for this. So instead of programming the board via JTAG, I generate a BOOT.bin file for each of the echo server applications, then I boot the board from the SD card. To boot the board from SD card, you will have to set the SW1 boot setting to 1-OFF, 2-ON.

What next?

Now that we've validated the hardware of this board, we could make it useful by getting Linux to run on it. When I find some time in the coming weeks, I'll generate PetaLinux for this board and test it out.

By the way, board bring-up is one of the services that we offer our customers. If your company or start-up has a custom board that you would like to bring-up, please get in touch. We are particularly good with Ethernet and PCIe interfaces.

↧

Introducing 96B Quad Ethernet Mezzanine

February 18, 2019, 10:48 am

≫ Next: Ethernet Mezzanine for Ultra96

≪ Previous: Board bring-up: MYIR MYD-Y7Z010 Dev board

Over the last few months I’ve been really busy working on a new product and I just want to take a step back today and share some of it. Let me start with what it is and then I’ll tell you about how and why I did it.

The product

A 4-port Gigabit Ethernet mezzanine card designed for Avnet‘s Ultra96 Zynq Ultrascale+ single board computer.

The benefits. The low cost and small form factor of the overall system makes it possible (I hope) to not only prototype new products but to go to market with this solution until investment in a custom design is justified by market demand. The other major benefit is community support – the Ultra96 has a large and growing user base and is part of an ecosystem that is supported by the 96Boards community as well as the PYNQ community. If you want to develop a product with this hardware, you’re not going it alone and you can leverage a lot of work that has already been done.

The applications. So much is done with gigabit Ethernet these days, here are some of the applications that this product will add value to: industrial controls, network security, smart NIC, network monitoring, hardware accelerated routers/switches.

The features

The 4 Ethernet ports each connect to a separate Ethernet PHY, the DP83867IS from Texas Instruments. This is a low power, robust, high immunity gigabit Ethernet PHY with many powerful features including:

Extra low latency (TX < 90ns, RX < 290ns)
Wake-on-LAN packet detection
IEEE 1588 Time stamp support
Cable diagnostics

The mezzanine has a stacked low-speed expansion connector that brings up all of the unused low-speed I/Os of the Ultra96 and can be used to attach a second mezzanine card such as the Sensors Mezzanine. The extended I/Os include:

10x I/Os of the programmable logic (PL) (can be reconfigured to GPIO, UART, I2C, SPI, or other low-speed interfaces)
3x GPIOs of the processor system (PS)
2x I2C buses (I2C0 and I2C1)
1x SPI bus (SPI0)
Power and reset pushbuttons

The story

I’ve been closely watching the Ultra96 since Avnet launched it early last year. Its powered by the Zynq Ultrascale+ MPSoC, it has the dimensions of a credit card and it is (at the time I write this) the lowest-price ZU+ board on the market. What’s more, it’s designed to the 96Boards spec so it can carry almost any of the standardized mezzanine boards (add-on cards) now on the market. When they then released PYNQ support later in the year, I had to get one.

Why make an Ethernet mezzanine? The marketing around the Ultra96 is mainly focused on AI, which is natural because most of the 96Boards SBCs are designed with vision applications in mind, and the ZU+ has enormous potential in accelerating neural networks. But the Ultra96 is also a great fit for Ethernet – the ZU+ has 4 internal Gigabit Ethernet MACs, and it also has the ability to run SGMII links on it’s I/O pins (SGMII is a serial PHY interface that operates at 1.25Gbps). Ethernet is also a great fit for us. Through our Ethernet FMC product, we’ve helped literally hundreds of companies develop their own Ethernet products. So an opportunity eventually fell into my lap. One of our clients was looking for exactly this kind of solution and they were willing to share the design costs. So I knew we had at least one customer, and a good part of our expenses were covered.

The design

There were quite a few design challenges to overcome to make this board happen. The constrained PCB real estate and location of the expansion connectors severely limited my choice of parts, most notably the quad RJ45. The Marvell PHYs that we use on the Ethernet FMC were not suitable for this, so I couldn’t reuse that trusted design. The pin assignment of the high-speed expansion connector on the Ultra96 wasn’t exactly ideal for Ethernet. The low I/O supply voltage meant that I couldn’t be sure that it was all going to work until I built and tested the actual hardware.

Limited PCB real estate. The obvious challenge was overcoming the physical constraints. The credit card sized form factor is great but it doesn’t provide much room for a big quad RJ45 connector and 4 Ethernet PHYs. But the limited space was not the only issue – there is almost no clearance between the bottom side of the mezzanine and the tops of the USB connectors on the Ultra96, so you can’t put components under there, and you definitely can’t have the legs of an RJ45 connector poking through. The other issue is that the MDI pins of the majority of RJ45 connectors would poke through at exactly the spot where the high speed expansion connector sits. I started by looking at surface mounted connectors, which are terrible for mechanical robustness, but I didn’t see any other way. It turned out that there were very few surface mount options available, and those that were were not suitable for various reasons. From the hundreds of thru-hole options, I ended up finding about 3 connectors that fit, but only if I squeezed two of the mounting posts between the two USB connectors. Also, because I needed the MDI pins to be well clear of the expansion connector, the RJ45 had to be especially deep, so it had to take up more PCB real estate than I would have liked to give up. Needless to say, my choice of Ethernet PHY was also strongly influenced by the limited PCB real estate – I needed the smallest package requiring the least external circuitry.

SGMII. The high-speed expansion connector of the Ultra96 has 14 differential pairs routed to it, connecting to 28 I/Os. To connect 4 Ethernet PHYs through 14 pairs, there is really only one possibility: SGMII. The Marvell PHYs that we use on Ethernet FMC have got an RGMII interface, so I couldn’t use them and I couldn’t leverage any of the Ethernet FMC‘s design.

SGMII over LVDS. SGMII is a serial gigabit interface, so one interface uses only 4 pins (2 for TX and 2 for RX). Normally, running gigabit interfaces to an FPGA requires gigabit transceivers, but the pins on the expansion connector don’t route to transceivers. So I had to use “SGMII over LVDS”. SGMII over LVDS is a method for implementing SGMII on I/O pins of the FPGA, where the transceiver (data recovery, encoder/decoder) is implemented by programmable logic in the FPGA. Xilinx Vivado includes a free IP core called Ethernet PCS/PMA or SGMII that implements SGMII over LVDS.

LVDS on a 1.2V bank. The high speed expansion connector of the Ultra96 routes to pins in an I/O bank that is powered by 1.2V. Here’s the problem: you can’t use the LVDS IO standard on pins in a bank that is powered at 1.2V. The solution turned out to be to use AC coupling (required by SGMII anyway) and what they call a “pseudo differential IO standard” called DIFF_SSTL12.

Pin assignment trouble. This was the biggest risk I faced in designing this board. If you’ve ever used the Ethernet PCS/PMA or SGMII IP, on the ZU+, you would know that it requires special selection of the I/O pins – you can’t just connect it to any I/O pins. But I didn’t have that luxury, I had to use the I/O pins that were already chosen by Avnet: the pins that were routed to the high-speed expansion connector. And it turned out that they were not ideally suited for this IP core. No combination of those pins would allow me to implement 4x SGMII links. Sh#t. I told my customer that we weren’t going to get 4 ports out of this. I spent days playing around with the core in Vivado and reading the product guides. I almost retired to the idea of having only 3 ports on this board. Then I got an idea. I might have found a way to sidestep the requirement of the Ethernet PCS/PMA or SGMII IP that was preventing me from using those pins. But it was a gamble, I didn’t know if it was going to work, I couldn’t find anyone in the forums who had tried it before and I couldn’t test it until I had the hardware in my hands. I had to give it a shot. I designed the board with 4 ports, knowing that we might not be able to get all of them working. Well it turned out that I was right, I got all 4 ports working and I’m glad that I didn’t settle for 3.

Your ideas

Now I want to know what you guys think. What would you do with this compact but powerful Ethernet development platform? What mezzanines would you stack onto it? I really hope that this mezzanine delivers value to a lot of projects – what’s yours?

↧

Ethernet Mezzanine for Ultra96

June 12, 2019, 12:14 pm

≫ Next: Measuring the maximum throughput of Gigabit Ethernet on the Ultra96

≪ Previous: Introducing 96B Quad Ethernet Mezzanine

Here’s a short introduction to our 96B Quad Ethernet Mezzanine. For more information, see the documentation at docs.ethernet96.com and if you want to buy the product, go to the order page here.

↧

Measuring the maximum throughput of Gigabit Ethernet on the Ultra96

July 17, 2019, 7:07 am

≫ Next: NVMe SSD Speed test on the ZCU106 Zynq Ultrascale+ in PetaLinux

≪ Previous: Ethernet Mezzanine for Ultra96

In this video I use Iperf to measure the actual maximum throughput of a Gigabit Ethernet port on the Ultra96 v2 running PetaLinux. The result that I get is pretty impressive: 910-940Mbps!

To do this test yourself you’ll need to build the example design that you’ll find on our Github repo, and then prepare the SD card. Plug your SD card into your Ultra96 and fit the 96B Quad Ethernet Mezzanine on top of the Ultra96. You need to connect the Ultra96‘s USB UART to your PC and open up a terminal (115200, 8-bit data, 1-bit stop, no parity).

You’ll also need to download Iperf for Windows and copy it to some location on your hard drive. I just copy it to C:/iperf.

You’ll need to physically connect both your Windows PC and your Ultra96 to a router with DHCP server so that both devices will automatically obtain an IP address. This isn’t really necessary but it just simplifies things. If you want, you can instead connect the two directly and set them up with fixed IP addresses.

At this point you can turn on the Ultra96 and let PetaLinux boot up. When it finishes, login and then run ifconfig to get the IP address of the Ultra96. Shut down the Ethernet port by running ifconfig eth0 down, then reconfigure MTU by running ifconfig eth0 mtu 9000. Then bring the port back up by running ifconfig eth0 up.

Now you can launch the Iperf server on the Ultra96 by running iperf -s.

Now run a command prompt on your Windows machine, cd to the location of Iperf and then run iperf -c <IP addr of Ultra96> to launch the Iperf client.

You should see this kind of output on your Windows machine:

C:\iperf>iperf3 -s

Server listening on 5201

Accepted connection from 192.168.2.16, port 42890 [ 5] local 192.168.2.18 port 5201 connected to 192.168.2.16 port 42892 [ ID] Interval Transfer Bandwidth [ 5] 0.00-1.00 sec 110 MBytes 920 Mbits/sec [ 5] 1.00-2.00 sec 113 MBytes 946 Mbits/sec [ 5] 2.00-3.00 sec 113 MBytes 946 Mbits/sec [ 5] 3.00-4.00 sec 113 MBytes 945 Mbits/sec [ 5] 4.00-5.00 sec 113 MBytes 946 Mbits/sec [ 5] 5.00-6.00 sec 113 MBytes 947 Mbits/sec [ 5] 6.00-7.00 sec 113 MBytes 949 Mbits/sec [ 5] 7.00-8.00 sec 113 MBytes 945 Mbits/sec [ 5] 8.00-9.00 sec 113 MBytes 949 Mbits/sec [ 5] 9.00-10.00 sec 113 MBytes 949 Mbits/sec [ 5] 10.00-10.03 sec 2.92 MBytes 940 Mbits/sec

[ ID] Interval Transfer Bandwidth [ 5] 0.00-10.03 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-10.03 sec 1.10 GBytes 944 Mbits/sec receiver

Those are pretty impressive results, especially for a single board computer, and they really demonstrate the power of the Zynq US+‘s quad core ARM Cortex-A53 clocked at 1.2GHz. Also not to forget, this wouldn’t be possible without jumbo frames, which is a critical feature supported by the integrated Gigabit MACs in the Zynq US+. All of this makes the Ultra96 a great little platform for developing network applications with the Zynq Ultrascale+.

↧

NVMe SSD Speed test on the ZCU106 Zynq Ultrascale+ in PetaLinux

December 2, 2019, 7:37 am

≪ Previous: Measuring the maximum throughput of Gigabit Ethernet on the Ultra96

Probably the most common question that I receive about our SSD-to-FPGA solution is: what are the maximum achievable read/write speeds? A complete answer to this question would require a whole other post, but instead for today I’m going to show you what speeds we can get with a simple but highly flexible setup that doesn’t use any paid IP. I’ve run some simple Linux scripts on this hardware to measure the read/write speeds of two Samsung 970 EVO M.2 NVMe SSDs. If you have our FPGA Drive FMC and a ZCU106 board, you will be able to download the boot files and the scripts and run this on your own hardware. Let’s jump first to the results.

Results

Single SSD write speed: 221 MBytes/s
Single SSD read speed: 627 MBytes/s
Dual (parallel) SSD write speed: 396 MBytes/s
Dual (parallel) SSD read speed: 1046 MBytes/s

In absolute terms, those are good speeds that will satisfy a lot of applications. The solution is simple and easy to work with because we’re accessing the drives from Linux. Also, the solution is reasonably priced because we’re not using any paid IP.
In relative terms however, we need to consider that the Samsung 970 EVO SSDs are promoted as having read and write speeds of 3,500MB/s and 2,500MB/s respectively. This solution is really only achieving 18% of the read performance potential and 9% of the write performance potential. The bottleneck is due to the way the NVMe protocol is being implemented in this setup. We’re basically using the processor to implement NVMe – the work is being done by software in the Linux kernel. If you want to speed it up, you need to offload that work to the FPGA – you need what they call an NVMe Accelerator IP core.

Try it yourself

If this simple setup is good enough for you, then try it yourself. You’ll need the ZCU106 board, the FPGA Drive FMC and 2x NVMe SSDs. Follow these instructions:

Download the boot files and copy them onto the SD card of the ZCU106
Make sure that the FPGA Drive FMC is properly attached to the ZCU106 HPC0 connector and that it has 2x SSDs installed
Make sure that the USB UART of the ZCU106 is connected to a PC and that a console window is open (115200 baud)
Make sure that the Ethernet port of the ZCU106 is connected to a network router with DHCP and an internet connection
Power up the ZCU106 and wait for PetaLinux to finish booting
Login using the username root and password root

Now once we’re logged into PetaLinux, you might notice that the SSDs have been automatically mounted to /run/media/nvme0n1p1 and /run/media/nvme1n1p1. Our next step is to download some bash scripts and run them.

To download the scripts, type the command wget "http://fpgadeveloper.com/downloads/2019_11_28/speed_test.tar.gz"
To extract the scripts, type the command tar -xvzf speed_test.tar.gz
Make the scripts executable by typing chmod 755 *_test
Now run the script by typing bash speed_test

Here’s what my terminal output looks like:

root@zcu106_hpc0_dual:~# wget "http://fpgadeveloper.com/downloads/2019_11_28/speed_test.tar.gz"
Connecting to fpgadeveloper.com (172.81.116.20:80)
speed_test.tar.gz    100% |********************************|   667  0:00:00 ETA
root@zcu106_hpc0_dual:~# tar -xvzf speed_test.tar.gz
dual_read_test
dual_write_test
single_read_test
single_write_test
speed_test
root@zcu106_hpc0_dual:~# chmod 755 *_test
root@zcu106_hpc0_dual:~# bash speed_test
-----------------------------------------------------------
Speed tests for ZCU106 and FPGA Drive FMC with 2x NVMe SSDs
-----------------------------------------------------------
Single SSD Write:
- Data:  4GBytes
- Delay: 18.083 seconds
- Speed: 221.202 MBytes/s
Single SSD Read:
- Data:  4GBytes
- Delay: 6.382 seconds
- Speed: 626.763 MBytes/s
Dual SSD Parallel Write:
- Data:  8GBytes
- Delay: 20.188 seconds
- Speed: 396.275 MBytes/s
Dual SSD Parallel Read:
- Data:  8GBytes
- Delay: 7.649 seconds
- Speed: 1045.89 MBytes/s
root@zcu106_hpc0_dual:~#

The scripts

The bash scripts are shown below. We are basically using one script per test and we use a fifth script to run each of the tests and measure the time taken on each script.
Single write test
Uses dd to create a 4GB file on a single SSD. Note we call sync at the end because the write isn’t completely finished until sync returns.

#/bin/bash
# Single write test

dd if=/dev/zero of=/run/media/nvme0n1p1/test.img bs=4M count=1000
sync

Single read test
Uses dd to read the 4GB file that was created by the write test.

#/bin/bash
# Single read test

dd if=/run/media/nvme0n1p1/test.img of=/dev/null bs=4M count=1000

Dual write test
Uses dd to create a 4GB file on each SSD (total of 8GB stored). To do these writes in parallel, we create two separate processes by using the & symbol, and we use the wait command to wait until the processes have finished. We call sync at the end because the writes aren’t completely finished until sync returns.

#/bin/bash
# Dual write test (parallel)

dd if=/dev/zero of=/run/media/nvme0n1p1/test.img bs=4M count=1000 &
dd if=/dev/zero of=/run/media/nvme1n1p1/test.img bs=4M count=1000 &
wait
sync

Dual read test
Uses dd to read the 4GB files that were created by the dual write test. We again have to use the wait command to wait until the processes have finished.

#/bin/bash
# Dual read test (parallel)

dd if=/run/media/nvme0n1p1/test.img of=/dev/null bs=4M count=1000 &
dd if=/run/media/nvme1n1p1/test.img of=/dev/null bs=4M count=1000 &
wait

Speed test
This main script runs each of the above scripts and measures the time taken to run them. It then calculates the throughput by taking the amount of data read/written and dividing by the delay in seconds. Note that each write test ends with a call to sync to ensure that the write was truly completed, and each read test is preceded by a flush of the disk cache to ensure that true read time is measured.

#/bin/bash

TIMEFORMAT=%R
echo "-----------------------------------------------------------"
echo "Speed tests for ZCU106 and FPGA Drive FMC with 2x NVMe SSDs"
echo "-----------------------------------------------------------"

echo "Single SSD Write:"
echo "  - Data:  4GBytes"
delay="$(time ( bash single_write_test >& /dev/null ) 2>&1 1>/dev/null )"
echo "  - Delay: $delay seconds"
echo "  - Speed: $(dc "4000 $delay / p") MBytes/s"

# Flush the disk cache before performing read test
echo 3 > /proc/sys/vm/drop_caches 1>/dev/null
echo "Single SSD Read:"
echo "  - Data:  4GBytes"
delay="$(time ( bash single_read_test >& /dev/null ) 2>&1 1>/dev/null )"
echo "  - Delay: $delay seconds"
echo "  - Speed: $(dc "4000 $delay / p") MBytes/s"

echo "Dual SSD Parallel Write:"
echo "  - Data:  8GBytes"
delay="$(time ( bash dual_write_test >& /dev/null ) 2>&1 1>/dev/null )"
echo "  - Delay: $delay seconds"
echo "  - Speed: $(dc "8000 $delay / p") MBytes/s"

# Flush the disk cache before performing read test
echo 3 > /proc/sys/vm/drop_caches 1>/dev/null
echo "Dual SSD Parallel Read:"
echo "  - Data:  8GBytes"
delay="$(time ( bash dual_read_test >& /dev/null ) 2>&1 1>/dev/null )"
echo "  - Delay: $delay seconds"
echo "  - Speed: $(dc "8000 $delay / p") MBytes/s"

↧