Data types¶
When creating a Deployment or Pipeline, you have to provide the type of input and/or output of the object, which can be plain or structured. This page gives an overview of all the data types that exist for plain and structured fields. It also explains how to convert some well known data structures to those data types.
Data type for plain input/output¶
The data type for plain input or output is simple: It can be any type of unstructured string, as long as it's JSON serializable. The plain data type is perfect for when you're not exactly sure what the data will look like, or when it has a varying number of variables.
Data types for structured fields¶
The structured deployment fields and the corresponding format of the data are depicted in the table below. The data format is necessary when making a request programmatically via a script or http request. When making a request through the WebApp, it's not necessary to give the string characters ('
or "
).
Data type | Data format |
---|---|
String | 'string' or "string" |
Integer | 9999 |
Double precision | 9999.99 |
Boolean | True /False |
Array of integers | [88, 99, ... ] |
Array of doubles | [88.88, 99.99, ... ] |
Array of strings | ['string', "string", ... ] |
Blob (file) | '0186e35b-233e-45ce-ad01-3e1d204a0bc0' (a blob uuid) |
The allowed data types are the same for R and Python deployments. UbiOps will handle all data in Python data types, but converts the data to R data types for R deployments.
Converting common data structures¶
Pipelines allow to connect the output fields of one deployment to the input fields of the next. Sometimes the data that you want to pass on, is in a specific data structure, for which there's no data type. Here is a list of examples on how to convert some common Python data structures to data types that can be passed on between deployments.
Pandas DataFrame¶
When developing Data Science applications in Python, data in tables is often stored in a Pandas DataFrame
1. A DataFrame
in itself is not possible to be outputted by a deployment, but there are ways to work around this.
df = pd.DataFrame([['a', 'b'], ['c', 'd']],
index=['row 1', 'row 2'],
columns=['col 1', 'col 2'])
-
Convert to JSON string: To output a
DataFrame
as a JSON string you can use the following snippet in the deployment code:output_df = df.to_json(orient='split') return { "df": output_df }
To read the data back in as a
DataFrame
in the next deployment, use:df = pd.read_json(data['df'], orient='split')
-
Output as Blob (file): You can also choose to write a
DataFrame
to a csv/pickle/excel/any file and pass it to the next deployment as a Blob (file). The example shows how to output as a .csv file:df.to_csv('output.csv',index=True) return { "df": "output.csv" }
To read the data back in as a
DataFrame
in the next deployment, use:df = pd.read_csv(data['df'])
Numpy Array¶
Another common data structure in Data Science is the NumPy Array
2. A 1-dimensional Array
can be passed on as string, but a multi-dimensional array should be passed as a Blob (file).
-
Convert to string: The
NumPy
methodtostring()
is deprecated fromNumPy
1.19.0 onwards. Instead, use the following snippet in the deployment code to output anArray
as string :arr = np.array([1,2,3,4,5,6,7]) # First convert the array to bytes new_arr = arr.tobytes() # Then convert bytes to string, specifying the encoding decoded_array = new_arr.decode('utf-8') return { "decoded_array": decoded_array }
To read the data back in as an
Array
in the next deployment, use:# First encode the string input to bytes, specifying the encoding bytes_array = data['decoded_array'].encode('utf-8') # Then load from buffer, specifying the data type of the original Array latest_arr = np.frombuffer(bytes_array, dtype=np.int64)
-
Output as Blob (file): The best way to pass a multi-dimensional
Array
between deployments is to output theArray
as either aNumPy Binary
file or a text file. The example shows how to output as aNumPy binary
file:np.save('data.npy', np.array([[1, 2, 3], [4, 5, 6]])) return { "np_file": "data.npy" }
To read the data back in as an
Array
in the next deployment, use:# Load the array arr = np.load(data['np_file'])
-
Code examples taken from the Pandas Documentation ↩
-
Code examples taken from the NumPy Documentation ↩