Working with Strings in Solidity
July 06, 2018
This is the first in a series of blogs we’re going to bring to you directly from the trenches, going into some of the nitty-gritty technical detail of some of the things we’re doing with the Protocol at the moment.
Today’s article comes from Alex Pinto, a recent addition to our blockchain engineering team who’s been spending the past few weeks getting up to speed on using Solidity, and will take us through some of the challenges and particularities of the language.
Today I give you a post about programming for the Ethereum blockchain using the Solidity language. I won’t follow any plan in doing this: my objective is only to write about my obstacles in learning this language and the practical difficulties I encounter in my daily work.
I want the freedom to write about any topic without having first to introduce preliminary material, as I’d have to do if I were writing a textbook. If you notice me talking about things I have not explained before, that is by design. Leave me a comment below and I’ll come back to them in a later post.
string and the related type
Both of these are dynamic array types, which means that they can store data of an arbitrary size. Each element of a variable of type
bytes is, unsurprisingly, a single byte. Each element of a variable of type
string is a character of the string. So far so good, but the initial looks are deceiving. One who comes from other languages might expect the
string type to provide several useful functions, like:
- determining the string’s length
- reading or changing the character at a given location in the string
- joining two strings
- extracting part of a string
Bad news: Solidity’s
string does none of this! If we need any of the above, we have to do it manually.
So, let’s explore some of these difficulties and see what we can do about them. I open Remix and type the following code in a new file called string.sol.
The right side of the screen, in Remix, is taken by the developer’s area. In the Compile tab, I check the Auto-Compile option, so that Remix will notify me of errors and code-analysis warnings as I write my code. The static code-analysis is controlled by the options in the tab Analysis, and I usually have all options selected.
In the current case, Remix will report two warnings of the same kind: the methods I have written can potentially have a high-to-infinite gas cost. I will ignore that in this post.
The above contract is very minimal. It defines a state variable
store of type
string, a method to set it and a method to get it. Let’s test it.
In the Run tab, I hit Deploy and if there are no problems with the contract, a new area will appear below that button with the address where the contract is located and the functions that are available.
Below the working area, Remix shows a detailed record of the transaction’s result. Initially, it shows only a line indicating the account that deployed the contract, the contract and method that was called, ie
String.(constructor), and how much Ether was passed to the execution (initially this is shown in Wei, which is the smallest unit of Ether, corresponding to 10^-18 Ether). We can expand it by clicking over the header, revealing logs, execution and transaction costs, available gas, final result, etc.
At this point, I just want to press the button getStore on the right, and notice how that shows beneath it the result:
Likewise, there is a new transaction log on the left and by clicking it we can see:
in the decoded output. All is well.
Now, I type “0123456789” in the textbox to the right of setStore and hit that button. Then I call getStore again and I receive that string. Thumbs up, we can do basic storage/retrieval with strings!
Let’s now go for more interesting things.
Creating new strings: data location
So far, I have accessed a literal string and we have seen how we can change it by assigning to it. But that is only a very coarse way of dealing with strings. Let us create a string character by character. This will introduce us to one peculiarity of Solidity programming: data location.
I create a new method that only returns a new string with three specific characters: “Abc”.
This is a well-intentioned effort, but does not work. Remix is kind enough to immediately point 4 errors and 1 warning:
Two of these are on the same line:
string newString = new string(3);
- Warning: Variable is declared as a storage pointer. Use an explicit “storage” keyword to silence this warning.
- TypeError: Type string memory is not implicitly convertible to expected type string storage pointer
The other three occur in the following lines, eg
newString = "A"; and are all of the same type:
- TypeError: Index access for string is not possible.
To understand the first issue, I have to tell you about data location. Writing to the blockchain is very expensive. Every node that runs the transaction has to do the same writing, which makes the transaction more expensive and the blockchain bigger. When a node downloads a block containing this transaction, it will incur larger storage costs because of this writing. In Ethereum, every transaction has an associated cost, called gas, to incentivise programmers to be as economic as possible.
When writing a contract, authors have a choice of what kind of data to use: memory is cheap (i.e. it costs relatively low gas, but the data are volatile and lost after a function finishes executing); storage is the most expensive (and is absolutely needed for contract state, which must persist from function call to function call); there is also a calldata location (that corresponds to the values in the stack frame of a function that is executing). This is the cheapest location to use, but it has a limited size. In particular, that means that functions may be limited in their number of arguments.
Every data type has a default location. This is from the Solidity documentation:
Forced data location:
-parameters (not return) of external functions: calldata
-state variables: storage
Default data location:
-parameters (also return) of functions: memory
-all other local variables: storage
Notice the subtlety: function parameters are by default stored in memory, except if the function is external, in which case they will be stored in the stack (ie calldata). This means that a function that is perfectly alright when
public can suddenly have too many arguments when made
Now, let’s come back to our code and examine the line
string newString = new string(3);
This is a local variable inside the function, and so by default it is in storage. The
new keyword is used to specify the initial size of a memory dynamic array. Memory arrays cannot be resized. On the other hand, we can change the size of a storage dynamic array by changing its
lengthproperty, but can’t use
new with them.
This is the source of our error. In this case, all we want to do with this string is create it and return it to the outside. Let the outside world decide what to do with it, and whether it is temporary only or important enough to persist on the blockchain. In this example, the storage is not important, and the string will be created in memory. To do that, we add the
memory keyword in the declaration, like this:
string memory newString = new string(3);
Direct access to strings: equivalence with bytes
Let’s see the second sort of errors now. This is simple and unavoidable: Solidity does not currently allow index access to strings. From the FAQ:
stringis basically identical to
bytesonly that it is assumed to hold the UTF-8 encoding of a real string. Since
stringstores the data in UTF-8 encoding it is quite expensive to compute the number of characters in the string (the encoding of some characters takes more than a single byte). Because of that,
string s; s.length;
is not yet supported and not even index access
The alternative is to first transform the string into bytes, and then access it directly. This works because
string is an array type, albeit with some restrictions.
But there is a trap to watch out for.
bytes stores raw data;
string stores UTF-8 characters. The following code does not always return the number of characters in
The problem here occurs if
_s contains any character that takes more than 1 byte to represent in UTF. In that case, the function returns the length of the byte representation of the input string, and will be more than the number of characters.
This has also an impact when trying to address a particular character of the string, as we cannot predict at which location the character’s bytes will be. We have to parse the string linearly identifying any multi-byte character, or else make sure we restrict our input to characters of fixed length. If we work exclusively with ASCII strings, for example, we’ll be safe.
Returning to our previous function, this works:
But for example, the following code which tries to set the third character of a string to X, will fail when it receives multi-byte characters.
This returns “AbXdef” for an input of “Abcdef”, but returns “XbÁnç!” for an input of “€bÁnç!”
There are still many more things that can be said about this topic, but this is a long enough post already, so I’ll wrap up. The key concept regarding the type
string is that this is an array of UTF-8 characters, and can be seamlessly converted to
bytes. This is the only way of manipulating the string at all. But it is important to note that UTF-8 characters do not exactly match bytes. The conversion in either direction will be accurate, but there is not an immediate relation between each byte index and the corresponding string index.
For most things, there may be an advantage in representing the string directly as the type
bytes (avoiding conversions) and be very careful when using characters that are encoded in UTF by more than one byte.
That’s enough for now. See you another day, with more steps in this coding adventure.